Christie Aschwanden writes:

Not sure you will remember, but last fall at our panel at the World Conference of Science Journalists I talked with you and Kristin Sainani about some unconventional statistical methods being used in sports science. I’d been collecting material for a story, and after the meeting I sent the papers to Kristin. She fell down the rabbit hole too, and ended up writing a rebuttal of them, which is just published ahead of print in a sports science journal.

The authors of the work she is critiquing have written a response on their website (the title, “The Vindication of Magnitude-Based Inference”) though they seem to have taken it down for revisions at the moment.

I’m attaching the paper that the proponents of this method (“magnitude-based inference”) wrote to justify it. Kristin’s paper is at least the third to critique MBI. Will Hopkins, who is the mastermind behind it, has doubled down. His method seems to be gaining traction. A course that teaches researchers how to use it has been endorsed by British Association of Sport and Exercise Science and Exercise & Sports Science Australia.

My reply:

The whole thing seems pretty pointless to me. I agree with Sainani that the paper on MBI does not make sense. But I also disagree with all the people involved in this debate in that I don’t think that “type 1 error rate” has any relevance to sports science, or to science more generally. See for example here and here.

I think scientists should be spending more time collecting good data and reporting their raw results for all to see, and less time trying to come up with methods for extracting a spurious certainty out of noisy data. I think this whole type 1, type 2 error thing is a horrible waste of time which is distracting researchers from the much more important problem of getting high quality measurements.

See here for further discussion of this general point.

**P.S.** These titles are great, no?

I scanned the paper on magnitude-based inference and watched most of a painful video explaining it. Maybe I’m missing something, but it looks like the whole idea boils down to:

1. Yes, there are problems with NHST and confidence intervals.

2. So, let’s introduce a new concept: MBI. We will bring in domain knowledge about meaningful sizes of effects (potentially both beneficial and/or harmful). Notwithstanding all the issues with confidence intervals, we will now unabashedly use those intervals, in conjunction with the domain knowledge, to derive decision rules based on those intervals. Except that we now ignore everything that is incorrect about the interpretation of the confidence intervals.

3. We now use MBI to derive decision rules based on confidence intervals in relation to meaningful beneficial or harmful effects. And, since numbers are hard for people, we’ll couch our decisions in terms of nice terms like “very likely,” “likely,” “possibly,” etc.

I actually appreciate their attempt to resurrect confidence intervals – despite the technically correct critique of confidence intervals, I still support the incorrect interpretation of the interval as the probability that the true parameter being within the X% confidence interval as being X%. Yes, it refers to a repeated procedure – but we often have only one sample and we need to use that to make decisions. I’ve said before – I think statisticians that insist on pointing out that confidence intervals are only valid for interpretations of a repeated procedure are being counterproductive – insisting on technical correctness at the risk of making an irrelevant point. I think this MBI idea is evidence that I am right about this. As a reaction to the critiques of NHST and confidence intervals we get nonsense like the MBI.

For example, one of things I find most useful about confidence intervals is their width. I take that as meaningful evidence about the degree of uncertainty in my evidence. The MBI application appears to completely ignore the width of the interval, choosing to focus on the endpoints of the interval and where they lie in the prespecified regions of harmful and beneficial effects.

As I see it, the truly damaging uses of inference are the attempts to reach dichotomous decisions using a mechanical procedure based on sample data. MBI relaxes this, reaching a slightly larger set of possible decisions – but at the risk of pretending the uncertainty is not really there. Hardly an improvement.

I have similar problems with Mayo’s severity testing – I hesitate to bring it up in the context of MBI as Mayo’s analysis seems to be far sounder and well reasoned than MBI. But I am left with the same feeling – that severity testing is an attempt to save the confidence interval as a useful concept. I believe it is useful. I think the harm is in attempting/insisting on making deterministic decisions on the basis of p-values, confidence intervals, MBI, severity tests, or anything else (to be fair to Mayo, I don’t think she is recommending reducing uncertainty to deterministic decisions).

Why is it so hard to use a confidence interval as meaningful evidence? Yes, surround it with all the caveats you need to (the repeated procedure, in my view, is one of the less important caveats as compared with nonrandom sampling, measurement issues, selection biases, etc.).

Am I missing the point? Is there more to MBI than I am seeing?

Dale.

> I think statisticians that insist on pointing out that confidence intervals are only valid for interpretations of a repeated procedure are being counterproductive.

OK, what would you suggest are other valid interpretations of a repeated procedure?

Some have been previously mentioned/discussed on this blog, but do you have preferences on those or alternatives?

I think you have identified the essence of MBI’s appeal. It takes the troublesome and counter-intuitive concept of a confidence interval, and, by sleight of hand, pretends it is actually the much more intuitive Bayesian posterior. Then it adds another layer on top of that, which covers up the fundamental sleight of hand.

MBI’s appeal comes from this underlying sleight of hand. People are born Bayesians and find it almost impossible to think in frequentist terms. Outside of trained statisticians, I have found that nobody (literally) thinks in frequentist terms, and it is almost impossible to explain the frequentist interpretation to people who have not taken a number of statistical courses. (One statistical course is usually not enough.) Even well-trained professionals often backslide into Bayesian interpretations of frequentist results.

Such is life.

This seems right – I had heard of MBI before but hadn’t taken much notice of it, but was motivated to have a look. It seems to boil down to pretending your frequentist confidence interval is a Bayesian credible interval (plus avoiding use of numbers as far as possible).

There were some papers advocating this sort of approach in medical research about 20 years ago – these were two (if I remember correctly):

https://www.ncbi.nlm.nih.gov/pubmed/11343760

https://jech.bmj.com/content/jech/52/5/318.full.pdf

It didn’t catch on, but maybe the inspiration for MBI came from there?

It is not unusual for a confidence interval to coincide (sometimes exactly, sometimes closely) with a Bayesian credible interval based on some sort of reference prior. This coincidence can be of use.

It’s fine to state that 95% of intervals constructed like this from random samples will contain the true mean. I’m not sure any alternative is better for a correct statement. But I then move to: we have this one sample. What is the probability that the true mean for the population lies within this particular confidence interval (constructed from this one sample)? I’ll go with 95%. And, I would add, provided that………………..

I know its wrong, but I think the nonsense like this MBI is a reaction to the fact that we want to say something based on our one sample. Unfortunately, if I am understanding the MBI correctly, it is being used to say things based on the confidence interval that should not be said – like a treatment is very likely beneficial if the lower end of the 95% interval lies in an ambiguous range while the upper end lies in the beneficial range.

“What is the probability that the true mean for the population lies within this particular confidence interval (constructed from this one sample)? I’ll go with 95%.”

What exactly will you do with this number? I mean this absolutely seriously. Can you list some productive calculations or computations that you can use this number for?

Daniel,

That is a question I was going to ask myself but thought coming from me it would be viewed as naive. But from you well, it will be taken seriously.

I’m not taking that bait. I believe almost all of the problems with the confidence interval result from answers to your question. Until human beings can live with uncertainty, the statement will be abused. I succumb to it as well – 95% becomes 100% as soon as we try to use it. But rather than tell me that my confidence interval tells me nothing, why not say it tells me what I stated and then focus on the real issue: just what does this one piece of evidence provide, is it sufficient for the decision at hand, and do I understand the consequences of the decision and the risks involved? And, is it feasible to get more/better evidence before I decide?

Dale:

Instead of “confidence interval,” I’d rather say “uncertainty interval.”

I actually now prefer “compatibility interval” at least initially in the inference process that with enough credibility of the assumptions leads to “uncertainty intervals”.

Confidence intervals start out as compatibility intervals, a set of parameter values not too incompatible with data and data generating process model and only make it to Confidence intervals when the data and assumptions seem beyond reasonable doubt. Credible intervals start out as compatibility intervals, a set of parameter values not too incompatible with data and data generating process model with respect to their initial distribution (prior) and only make it to Credible intervals when the data, assumptions and prior seem beyond reasonable doubt.

Drawing from arguments here – Inferential Statistics as Descriptive Statistics: Amrhein1, Trafimow and Greenland https://peerj.com/preprints/26857.pdf

Thank you for the link to that paper – it is very good and helps a great deal.

OK, I was baiting you a bit – let me explain.

In repeated batch lot testing everyone seems fine with the confidence interpretation. On the other hand, in routine diagnostic testing in a population with a well know prevalence of disease, everyone seems fine with taking posterior probabilities as relevant and even literally.

But when there is the perception of a single case (say the case of the last batch sample done before the production line is destroyed), the first is seen as nonsensical. While when the distribution of parameter values in the assumed prior has no relevance at all to the unknown parameter values one is trying to pin down, some Bayesians (Rubin coined these sage Bayesians) don’t take the resulting posterior literally or even as relevant.

In your example, presumably its a flat prior that gives that posterior probability and the idea that almost always, the unknown difference one will be trying to learn about will be of magnitude greater than 10^99999 does not seem relevant.

Now the point of the baiting is that it seems any discussion of something in between seems taboo and though it is occasionally blogged about (e.g. in this blog) and discussed in talks and a few papers – most seem to overlook or neglect these discussions. So that’s primarily why I asked for clarification.

I think (though I might be mistaken) that the Bayesian angle to this is largely a red herring. I like Andrew’s rephrasing of the “uncertainty interval.” My takeaway, at this point, is that any mechanical procedure to use the uncertainty interval for decision making is likely to go astray. In many fields (economics, my own discipline, in particular), most analyses are based on a single sample – usually observational and not random in any sense. But the analyses are to be used to inform decision making. It is mostly like your “first sample” and there are rarely (too rarely) subsequent samples. The prior information is usually debateable (although I would say it is too often ignored entirely). So, the question is: under such circumstances, what can be said from a single study?

I guess I am thinking that there is no mechanistic answer that is satisfying. Perhaps that is much of the problem with NHST – we teach people that their answer is reject/do not reject rather than the reasoning behind the decision they reach – which is not mechanistic (although it may involve a number of calculable probabilities). If there was a mechanistic decision procedure, then we hardly need humans to be involved. If humans are to be involved, then it means the essential parts of the decision involve reasoning – reasoning about the uncertainties, the kinds of mistakes we might make, the probabilities that we can determine, and the potential/need for further studies.

I am interested in any reactions to this, but I do want to return to MBI for a moment. In one sense, it is a step in the direction I am stating – use domain knowledge to establish limits from which to take actions, and then establish probabilities in relation to these limits. On the other hand, it appears to hide the uncertainty in the estimates and tries to establish a mechanistic decision making procedure to resurrect most of the NHST apparatus. Am I understanding that correctly?

I think you’re moving towards the answer I have to my own baiting above ;-)

The fact is you can’t *make decisions mechanically* because decisions ultimately need to be about usefulness/utility, and no robot can decide for you or any group of people, what the utility should be.

However, once some kind of utility has been decided on, you *can* calculate mechanically, and *this* is what Bayes does. The frequentist probability that 95% of constructed intervals contain the true value just doesn’t let you calculate in the way you need to, relative probabilities *within* the interval can’t be calculated in Frequentism. In particular you’ll get different results with different tests of the same set of hypotheses, etc. To get consistency *within* the interval you’ll need to adopt the likelihood, and then if you insist on flat priors you’ll be doing something that is provably dominated in frequentist terms compared to a real-world prior in all but the most awkwardly constructed fake problems (ie. problems where parameters of interest have numerical values that really do exceed the limits of an IEEE floating point number for example).

So I think the answer is clear: teach decision making, teach the importance of utility, and teach Bayes. Drop confidence and NHST because just like you say ” Perhaps that is much of the problem with NHST – we teach people that their answer is reject/do not reject rather than the reasoning behind the decision they reach”. Instead *teach reasoning about decision making* NHST is a broken paradigm whose goal was to mechanize discovery in an age when it seemed like mechanizing everything was a good idea (1940’s Dieselpunk ethos).

It seemed to me that what the authors of the magnitude-based inference are doing is just a variant of equivalence/inequivalence testing based on using confidence intervals instead of P-values, for which there is a wealth of statistical literature that they have not referenced (and which has been done in both a frequentist as well as Bayesian paradigm). But the MBI folks alter the confidence level used for different endpoints of the CI depending on tradeoffs in risk, what in equivalence/inequivalence testing is often referred to as consumer versus producer risk. So this seems like a slight twist from usual equivalence/inequivalence approach I’m familiar with. The difficulty with either the MBI or the equivalence/inequivalence approach is setting the effect sizes that are considered trivial. Often these have to be fixed by regulatory requirements but in many scientific applications there would seem to be lots of opportunity (although rarely done) to create more fuzzy boundaries and investigate those. The basic premise of using CI associated with equivalence/inequivalence testing to establish whether your data has more or less evidence for an effect of substantive interest seems to me one of the best ways to make usual frequentist based CI more relevant to scientific investigations.

Is MBI too complicated to have any realistic chance of being adopted? It adds another layer to the NHST and Bayesian paradigms, and statistics is already pretty confusing to all but the illuminati without that extra layer.

Links are dead

Thanks to everyone for their comments – very much appreciated. Just to note briefly that MBI is not a ‘new’ method and there is nothing magical-mystical about it. Laid bare, it’s a simple ‘objective’ or ‘reference’ Bayesian analysis (for want of better terms) with a dispersed uniform prior (which in many/ most cases is congruent with the standard frequentist confidence interval). The interval is interpreted, relatively uncontroversially I believe, as a plausible range of effect sizes compatible with the data and model. Beyond this, posterior probabilities are derived that the ‘true’ population effect is beyond some threshold that is considered clinically or scientifically meaningful, with qualitative descriptors applied to this probability to help with interpretation. Aside from acknowledged debate/ critique about the merits or otherwise of the prior assumption, it’s this last part that people seem to be getting particularly upset about recently. However, the Intergovernmental Panel on Climate Change (IPCC) uses a similar scale for helping people to interpret probabilities (1). For example, the IPCC’s claim that it is “extremely likely” that the majority of climate change over the last century is person-made is based on scale that assigns “extremely likely” to a probability of at least 95%. Note also that the IPCC thresholds are somewhat less stringent than MBI, as they require a probability of >66% to declare a finding “likely”, versus 75% for MBI (https://www.nature.com/articles/nclimate2194/tables/1).

In any event, the thresholds underlying the MBI descriptors of likely, very likely etc. were never intended to be prescriptive or to encourage/ force dichotomous thinking. Rather, in extensive consultation with clinicians and practitioners over the years, they represent a first stab at such a scale. The overriding intention is indeed, to fully and properly embrace uncertainty of estimation. Certainly no ‘sleight of hand’ or lowering of the standards of evidence was ever intended. There is no doubt that – as with all methods – there are examples of misuse of MBI, so the way that MBI is sometimes used in practice is something of a cause for concern. I agree with Andrew, of course, that the main focus should be on better designs, more precise measurement, improved reporting, and raw data sharing in this open science era, rather than sterile debates about methods or ‘rules’ of inference. I also agree that a focus on error rates is a distraction, especially as MBI does not involve hypothesis testing. Nevertheless, we took up this challenge as at some stage someone has to make a decision based on data – though preferably not from a single study – and decisions come with attendant errors. Essentially, our feeling is that not attempting to assist with interpretation of findings to aid decisions affecting policy and practice is ‘kicking the can down the road’, as it were. I am, however, grateful for David Spiegelhalter’s wise counsel, that these are deeply contested issues, no opinions should be considered definitive, and one should be wary of appeals to ‘authority’. Thank you all once again for your interesting feedback – I am genuinely very grateful for the discussion.

1. Mastrandrea MD, Field CB, Stocker TF, Edenhofer O, Ebi KL, Frame DJ, Held H, Kriegler E, Mach KJ, Matschoss PR, Plattner G-K, Yohe GW, Zwiers FW (2010). Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties. Intergovernmental Panel on Climate Change (IPCC): https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note.pdf

Alan:

> one should be wary of appeals to ‘authority’

Agree, so we can disregard the reference to IPCC

> not attempting to assist with interpretation of findings

Also agree – for many of us that should be a main part of our job

> as a plausible range of effect sizes compatible with the data and model

That’s doable without a prior but with a prior the compatibility needs to include between the prior and background knowledge and I don’t believe the ‘objective’ or ‘reference’ priors are. Rather they simply restore the compatibility assessment to being in frequentest properties. For instance, in OBayes publications a big deal is made in obtaining frequency properties that sometimes even better than frequentest derived confidence intervals.

Some of this is discussed in linked paper I gave above, in the simulation in Andrew’s post here https://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/ and in Dan Simpson’s post on simulating fake data from default priors and how impossible such data is in reality https://andrewgelman.com/2018/09/12/against-arianism-2-arianism-grande/

The big picture is that attempting to assist with interpretation of findings is an open question without any group being able to convince others widely as to how to go about it.

Hi Keith,

Just to clarify, the IPCC reference wasn’t intended as an appeal to authority. Rather, I was merely pointing out that others have come up with similar descriptors of probabilities to try to help make sense of the data/ findings. Thanks for the notes and citations/ links to posts on objective Bayes and reference priors – I’ll check those out. Agree with your last point about the big picture.

Thanks for your reply.

For your information, there is a discussion on interpretation / communication of findings going on here, with great contributions by Frank Harrell and Sander Greenland:

“Language for communicating frequentist results about treatment effects”

https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effects/934

Among many other topics, Sander explains why he prefers “compatibility interval” to “confidence interval”:

https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effects/934/39