Charles Jackson writes:

The attached item from JAMA, which I came across in my doctor’s waiting room, contains the statements:

Nineteen of 203 patients treated with statins and 10 of 217 patients treated with placebo met the study definition of myalgia (9.4% vs 4.6%. P = .054). This finding did not reach statistical significance, but it indicates a 94.6% probability that statins were responsible for the symptoms.

Disregarding the statistical issues involving NHST and posterior probability, I assume that the author means to say ” . . . that statins were responsible for the symptoms in about half of the treated patients.” I doubt if statins caused the myalgia in the untreated patients.

Yup. Statistics is hard, like basketball, or knitting. Even Jama editors can get these things horribly wrong.

I just couldn’t believe that JAMA would publish such nonsense, especially in a Viewpoint discussion piece, so I went and looked. Well, they did.

The good news is that someone(s) pointed out the “common misinterpretation” http://jamanetwork.com/journals/jama/article-abstract/2610328 and the online version has been “corrected” to leave out the “94.6% probability” claim. The bad news is that the original Viewpoint was authored by a chief of cardiology at a prominent institute who has 400+ publications. In reviewing a few of them I see that he’s a big fan of evidence based medicine and has chaired some high powered expert panels tasked with assessing the literature on certain drug-induced muscle related adverse effects. Apparently even the high and mighty (and those tasked with deciding what’s best for us health-wise) are at risk of stepping into one of Stat’s many subtle snares.

I wouldn’t say *even*, here. They are “high and mighty” because misinterpreting stats leads to more publications, which is the usual measure of productivity.

+1

The problem is incredibly widespread in medicine. Most practitioners and academics fall into the misconceptions that are oh so tempting. Only two types avoid the temptation.

1) The extraordinarily numerate (very few people)

2) Those completely disinterested in data (the storytelling folk), but even they fall victim to the stories being told by those who succumbed to temptation.

Gerd Gigerenzer describes a SIC Syndrome amongst doctors (Self defence, Innumeracy, Conflict of interests). The sad thing is that so often in medicine there is no data to illuminate a particular question and then when there is data, it simply becomes raw material for the SIC.

Before I read the comments, I too went to find the original article and was at first confused by the discrepancy. Interestingly, there is a digital artifact. If you paste in the quote as Jackson has it in his comment above, Google Scholar finds it verbatim But then when you retrieve the actual article the offending clause is gone. There is a statement of correction as follows: “This Viewpoint was corrected December 12, 2016, to change language regarding the STOMP trial” which is pretty disingenuous.

I’m just contributing this comment to highlight the concern that it is hard to completely correct mistakes in printed materials these days, making even understandable errors substantially problematic.

Jackson’s correction is also incorrect, ie, the finding does not mean, “a 94.6% probability that . . . that statins were responsible for the symptoms in about half of the treated patients.” That’s an interpretation of the point estimates. The p-value means that if the population proportions were actually identical, the sample proportions would differ by less than the observed difference with probability 94.6%.

Chris:

I don’t think Jackson was trying to offer a correction. He was saying “I assume that the author means to say,” meaning that the author’s statement was wrong even beyond the usual way such statements are mathematically incorrect, in that it made no sense even from a purely medical perspective.

I think that’s covered in the “disregarding the statistical issues involving NHST and posterior probability”.

In my experience, “indicates” is one of those weasel words that is often a clue* that the writer/speaker doesn’t really understand what they are talking about.

*Full disclosure: I initially wrote “often indicates” instead of “is often a clue”.

Too bad — you could have used ‘often indicates*’ with a footnote reading “doesn’t count this time”.

To me, this illustrates the most damaging consequence of the p-value’s ubiquity: the widespread belief that (1 – p) = the probability my hypothesis is correct. I remember thinking of p-values this way when I first started grad school. I’ve talked to a number of friends who use NHST in their own work and who think of p-values this way. I know that my students’ natural inclination is to think of them this way, despite my best efforts.

The problem isn’t just that this misinterpretation is so widespread. It’s that this misinterpretation makes the p-value seem like is has vastly more evidential value than it really does. I’ve come to believe that most people who use p-values honestly believe that p = 0.05 means “there’s a 95% chance my hypothesis is correct”. Or maybe they know that this isn’t *technically* correct, but they feel it’s close enough. It is really really hard to convince casual users of p-values that their intuitive understanding isn’t just a little bit wrong, but dramatically and grotesquely wrong.

I think this problem needs to be distinguished from the problems of researcher degrees of freedom, straw-man hypothesis testing, and publication bias. Those things could still exist if we replaced dependence on p-values with dependence on d- or r- family effect sizes, likelihood ratios, Bayes factors, or any other statistical summary that can be compared to a threshold and thus used in a cynical manner. P-values are special in that they aren’t just used cynically, but they are also nearly universally misinterpreted to mean “the probability that I’m wrong”. And this makes them especially dangerous.

Ben:

Yes, I agree. Related is the last sentence of this article.

Here’s yet another new example, recently featured on Deborah Mayo’s blog:

http://www.slate.com/articles/health_and_science/science/2017/08/how_will_changing_the_p_value_threshold_affect_the_reproducibility_crisis.html

“P-values are tricky business, but here’s the basics on how they work: Let’s say I’m conducting a drug trial, and I want to know if people who take drug A are more likely to go deaf than if they take drug B. I’ll state that my hypothesis is “drugs A and B are equally likely to make someone go deaf,” administer the drugs, and collect the data. The data will show me the number of people who went deaf on drugs A and B, and the p-value will give me an indication of how likely it is that the difference in deafness was due to random chance rather than the drugs. If the p-value is lower than 0.05, it means that the chance this happened randomly is very small—it’s a 5 percent chance of happening, meaning it would only occur 1 out of 20 times if there wasn’t a difference between the drugs. If the threshold is lowered to 0.005 for something to be considered significant, it would mean that the chances of it happening without a meaningful difference between the treatments would be just 1 in 200.”

And this is from someone who’s been researching the topic enough to write an otherwise pretty good article on it. But he still can’t distinguish between p(data|null) and p(null|data) – or he thinks they’re close enough that they can be used interchangeably.

Hey—I write for Slate! They should’ve asked me before running that.

Yeah, I thought the same thing. This might be a case of the “I understand easy statistics, just not hard statistics” phenomenon that also leads people who are “just” doing t-tests and one-way ANOVAs to never talk to a statistician (and so never learn about any deeper errors regarding their data / methods). They know better than to try to use a GLMM with a zero-inflated response and correlated errors by themselves, but if it’s just something simple like pressing the t-test button in SPSS or making up an easy to understand example to show what a p-value is…

Following with interest, but can I ask for assistance. I soon need to lead a brief workshop for non-statisticians on interpreting statistical evidence in a particular area of behavioral science. They are for more interested in the subject matter than the stats — does anyone have a good, brief, layperson-accessible reference on correct (or at least skillful) interpretation of p-values?

I need something simple enough to present in no more than 10 or 15 minutes, and I have yet to find anything that is accurate, accessible to my audience, and appropriately brief.

No, this doesn’t exist and probably cannot exist at this point. So many misunderstandings need to be unraveled (and each person probably needs a personalized explanations) that it will take much longer.

The first thing is that it is impossible to reconcile the correct understanding with a belief that research based on NHST is reliable (which is almost all modern research). I’ve come to think that if the person still believes the latter they can never understand p-values, since they will simply not believe that so many smart people are wrong.

+1

I agree — 10 to 15 minutes is nowhere near enough time. When giving a continuing ed course, I’ve taken about six hours to do this.

I think one can explain what the p-value *is* in 15 minutes (the probability of obtaining etc, etc). But I see how people could spend hours trying to understand why would they want to calculate *that* instead of something more meaningful.

“I think one can explain what the p-value *is* in 15 minutes (the probability of obtaining etc, etc)”

This assumes that the students know what probability is. And can distinguish between the probability of A given B and the probability of B given A, and lots of other things. Anything that is done in 10 – 15 minutes is bound to be open to misinterpretations.

Don’t they learn probability in high school? Probably they understand at least that a die has six faces and if I toss the die the probability of each outcome 1,2,3,4,5,6 is 1/6.

A) I suspect someone is cheating and using a loaded die which shows a high number (4, 5, or 6) more often than it should (the students will probably agree that the probability of this event is 1/2).

To check whether the die is loaded, I define the null “it is not loaded and the probability of getting a high number is 1/2” and I toss the die 8 times. I get high numbers in 7 occasions and only once a low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).

What is the probability that the die is loaded, given our data? We don't know!

B) The quality control deparment in a dice factory wants to verify that there are no manufacturing issues. One of the tests is to check whether high numbers appear too often (the faces with low numbers are slightly heavier).

The null hypothesis is, as before, that the probability is 1/2, I toss the die 8 times and the result is 7 high numbers and 1 low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).

What is the probability that the die we are checking is biased, given our data? We don't know!

C) Uri Geller claims telekinetic powers and we perform an experiment to see if it true. We will throw a die (perfectly fair as far as we know) 8 times, and record how often he "forces" a high number (4, 5, or 6).

The null hypothesis is, as before, that the probability is 1/2, I toss the die 8 times and the result is 7 high numbers and 1 low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).

What is the probability that Uri Geller has telekinetic powers, given our data? We don't know!

You need to have a discussion about “what does it mean… the probability that *this* die is loaded?”

From a frequency perspective, it’s simple to say “how often would we get a result more unusual than this if the die is totally fair and we repeat the experiment very very many times” but does this mean something close to “what is the (1-probability) that *this* die is loaded?”

driving home that distinction… not so easy

It doesn’t matter, because the p-value doesn’t have anything to do with loaded dice! :-)

Right, but explaining that to people… that’s the hard part, because we’re talking about overcoming what seems to be a built-in misinterpretation, people are natural bayesians, they want the p value to mean “the probability that the null hypothesis is true” and they want 1-p to mean “the probability that my favorite hypothesis is true” ;-)

Carlos,

Whether or not probability is taught in high school depends on the country (and in the U.S., on the state or even the individual school district).

Your examples are nice ones, but need to be followed up by (and connected to) examples (such as those in most scientific research) which are not as straightforward as your examples.

Of course one cannot make miracles in 15 minutes. But I think one has to use a simple example, and not just because describing a complex example takes longer. A real-world example gives too many opportunities to get lost in the details.

A toy problem is enough to point at the obvious issues with p-values like depending only on the null hypothesis (and not on the hypothesis of interest) and not including any information about how likely is the null hypothesis to be true ex-ante (let alone the alternative).

One would intuitively expect the probability of a loaded die given the data in the examples above to have the ordering A > B > C (cheating suspected > routine check > telekinesis impossible). The p-value is the same, though. “Rejecting the null” equally in the tree cases doesn’t make much sense, and for the telekinesis example even if we “reject the null” we shouldn’t accept the alternative at face value.

I agree that people has an amazing capability to misinterpret p-values. But that’s mostly a problem with people, not a problem with p-values :-)

The

Carlos said,

“I agree that people has an amazing capability to misinterpret p-values. But that’s mostly a problem with people, not a problem with p-values :-)”

But if we are talking about teaching, we are talking about people. So giving an explanation is not the same as people understanding it well enough to use it correctly — and the latter is the goal.

Do they actually do this? I found some info where they do not describe such testing, instead they rely on careful control of the materials and manufacturing process:

http://www.madehow.com/Volume-4/Dice.html

Another source missing a description of NHST:

http://www.midwestgamesupply.com/dice_manufacturing.htm

Here, dice from two manufacturers are tested and it is found that *neither* passes the “randomness test” (and probably none exist that do):

http://www.awesomedice.com/blog/353/d20-dice-randomness-test-chessex-vs-gamescience/

The great thing at that last link is that they come up with a theory to explain the distribution of rolls for one of the dice (not just why there should be a significant p-value), and… do a direct replication study! That is so much better science than most of what is getting published these days.

tl;dr: NHST is not even useful for checking dice.

* Step 1: Every measurement from a sample leaves some uncertainty about the real truth in the population. So for whatever is measured, you can calculate a corresponding margin of error and confidence interval.

* Step 2: All that is meant by ‘statistically significant’ is that whatever you have defined to mean ‘no change / no effect’ is not inside the confidence interval.

Here are two figures which can help:

https://www.dropbox.com/s/0b73r2np7690nz5/Figure6-3.png?dl=0

https://www.dropbox.com/s/ky92u7l0t7ng6w5/Figure6-4.png?dl=0

There are lots of roads to good inference, but talking about margins of error and confidence intervals is one road almost everyone can travel quite easily. Once someone is comfortable with that, explaining what a p value means in terms of within/without the confidence interval can give folks a pretty decent understanding of what the p value is actually measuring.

To my mind, learning about these issues at a broad conceptual level is good enough for most people (they may even fail to learn or even wonder how the margin of error is calculated). The key *first* step is to help those who might end up using/consuming stats to understand the basics so that they can focus on what will really matter: establishing reliable and valid measures, making good comparisons, replicating/repeating their work to check if the conclusions they have been drawing hold up. It can also be helpful to point them to Dr. Gelman’s work showing that the real margin of error is often bigger than what is mathematically predicted for an ideal case–so in practice folks might want to mentally double the width of the confidence intervals when making their interpretations/plans for the future.

I’m sure others will say the same, but all you have done is moved the burden to explaining confidence intervals. The people will want to interpret these as credible intervals, which is wrong, but sometimes they can be approximately the same…

So what is a confidence interval in your explanation?

+1

First, most people have an intuitive sense of what a confidence interval is, which is essentially “a range of plausible/credible values for the truth’.

That intuitive conception is *not* strictly correct, and many other commonly-adopted interpretations of frequentist-based confidence intervals are also technically incorrect (as extensively documented, for example, by Richard Morey). But I would argue that in general the misconceptions related to frequentist CIs are relatively benign compared to those related to p values. For someone intepreting data on a question about which they have relatively little prior information, I don’t think they can go too very far wrong with thoughtfully interpreting a CI (assuming, of course, the measurements are reasonable, etc.). If you have examples where that’s not the case, I’d love to learn about them.

For those who want to dig a bit deeper into where CIs come from, simulations can be really helpful (like this one: http://rpsychologist.com/new-d3-js-visualization-interpreting-confidence-intervals )

Finally, credible intervals are great. If someone wants to start there or, learn both approaches, that’s awesome. My point is that for those who think they want to know about p values, understanding them in terms of frequentist CIs can a) give them a much better sense of what ‘statistically significant’ means, and b) give them the tools to convert p values to CIs to hopefully be more thoughtful and less dichotomous in their thinking about the data obtained.

So do you teach that the “true” parameter has a 95% probability of being inside the 95% confidence interval, or no?

“pretty decent understanding of what the p value is actually measuring”

Which they will promptly forget after the exam, and will fuzzily recall as being something, more or less, like the inverse probability fallacy.

+1

+1 Kyle.

I get to see this in action – I teach a two semester “intro to stats” sequence and I get most of the same students back to back. On the first exam of the 2nd semester I ask them what a p-value quantifies. ~75% of them state some version of the inverse probability fallacy, even though I had covered this specifically and warned them that it would be on the 1st semester’s final exam.

I’ve found that if I focus heavily on addressing one version of the fallacy (e.g. “The probability that the null is true”), they’ll just switch over to another one without realizing it means the same thing (e.g. “The probability my results were just due to chance”). By the end of the 2nd semester I have all but beaten this out of them, but who knows how long that lasts. At a minimum, I don’t think it is possible to teach someone what a p-value is without dedicating a large amount of time to what a p-value isn’t.

I really like Bob’s advice in his final paragraph. This stuff needs to be emphasized in an intro stats class. But how to address p-values? It’s unfortunate that we are left with the choices of 1) don’t cover p-values and leave students ignorant of the most dominant popular measure of “statistical evidence”, 2) cover p-values the traditional way and leave students with all sorts of dangerous misunderstandings, 3) cover p-values and take a critical approach, which some students will interpret as “my stats instructor told me statistics is BS”.

Thanks, Ben. Can’t you tell your students, “This is what it means,* no more, no less, there is no correct paraphrase, if anyone says anything different, they are wrong. If you give the inverse probability or ‘due to chance’ definition in this class, you will fail.”?

* I may butcher the definition myself, but something like, the probability of seeing a result that is this far from the null, if and only if our statistical model of the phenomenon were precisely correct and the null were precisely true, neither of which we ever believe.

Well, of course, then the obvious question is “if that’s the only valid meaning, why do I care at all?” which leads back to Ben’s option (3).

Haha, yes. I could do that! I should probably also take on: “Despite the fact that everything about the way you see this method being used in practice will point you toward beliving the inverse probability definition that you find so intuitive, you must resist this. It doesn’t matter that the whole act of using an estimated probability to declare things significant or not significant would make far more sense were your intuition correct. Your intuition is no correct. If this confuses you, embrace the confusion.”

+1 Jason. The better we explain this, the more our students will understandably wonder why everyone is using it as the principle means of establishing “statistical evidence”.

Kyle: Still not definitive, but adding some things you omitted:

“the probability of seeing a result that is at least this far from the null (as measured by the same variable, with a sample of the same size as used in the study), provided the statistical model of the phenomenon is correct and the null hypothesis is true (bearing in mind that the latter two conditions are rarely if ever met.)

Bob,

The picture in your first link says something like “If your value is here, don’t reject.” That’s the kind of rule that avoids understanding and can lead to poor use of statistics.

Yep, agreed. But the question was how to explain p values, so of course hurtful dichotomous thinking was going to come up. I’d rather not teach them at all, not only to avoid dichotomous thinking but also because the comment from Ben is pretty spot on: it’s mostly a fool’s errand; most students just won’t be able to resist gravitating towards erroneous conceptions of what p means.

The only thing a p value tells you is whether you could adequately explain your data as coming out of a given RNG.

If the answer is no… Then you know it isnt adequately explained by that RNG, and nothing else.

What if the question is whether the misfit could be adequately explained by a RNG? That is, whether the misfit between a (not necessarily statistical) model and the data ‘looks like noise’. Does that still seem so unreasonable?

Not clear what you’re trying to ask. What do you mean by “misfit between a (not necessarily statistical) model and the data” and “looks like noise”? A specific example might help clarify.

Yes, that’s what I meant since rarely is a real RNG at work, its just that an RNG is adequate. This is why failing to reject is usually more informative. It means you could pretend an RNG is at work without self contradiction.

Note this isn’t the same as proving an RNG is at work, which is probably the common misinterpretation when failing to reject.

Yeah I agree, failing to reject is usually more interesting. And yes, doesn’t mean the pattern ‘really’ is from a RNG, since it isn’t (unless you count quantum mechanics or whatever) but that it can be treated as such. You have an adequate model. There may be many others too.

Martha: Draw a straight line through some data. This is not a statistical model, except for the implicit acknowledgement that you don’t have to fit every point (even then this could be thought of as an approximation rather than anything statistical).

Then, to bring in statistics, you could ask ‘do the residuals look like noise?’ and do a formal hypothesis test (if you wanted). Failure to reject can be interpreted as saying your straight line model isn’t ignoring any obvious statistical patterns.

“Do the residuals look like noise?” still doesn’t answer the question of what “looks like noise” means.

Right, the more explicit statement is “do the residuals look like noise out of this specific RNG noise process” But this can often be a perfectly reasonable question to ask, the key is to teach people that there’s no such thing as generic “noise” there’s an infinity of possible individual kinds of noise.

Martha – sure, I agree. One way to get a feel for possible noise patterns would be to bootstrap the residuals. Or whatever.

Basic point is that a hypothesis test can make sense in this context.

In fact, saying that there are many possible noise patterns is very true and also an argument against adopting any one true model and working within it, and in favour of eg EDA.

I’ve still got a gripe with “Do the residuals look like noise?”

I use the definition, “Data are numbers in context.” So anything that just depends on the numbers (not taking the context into account) is ignoring part of what constitutes the data. Taking context into account, noise could be of various sorts. For example, in talking about brain scans, “noise” could refer to variability between individual brains. (Andrew often uses this meaning for “noise”). It could also refer to measurement inaccuracies (because the measure is not perfect — again, Andrew often talks about this kind of noise), or also to external things that influence the measurement (e.g., literal noise, or vibration, etc. in the case of brain scans.)

I agree that context is important.

The phrase ‘do the residuals look like noise’ is not to be interpreted sans context. It’s shorthand for ‘am I missing anything’. What you expect noise to look like depends on context, judgement etc.

I was just saying that you could translate this judgement into a hypothesis test and in this context- analysing misfit – it seems to make some sense to me.

Compare misfit to fake, simulated noise and see if you can tell the difference. If you can’t, you have little justification for adopting a more complex (or whatever) model.

Yes, I’ve made this point somewhere too. In the conventional view of things, you learn from a rejection (p less than 0.05 or whatever) but you don’t learn anything from a non-rejection (as you can never accept the null hypothesis). But I think that, in the overwhelming majority of cases, it’s the other way around: rejection is uninteresting as all it tells you is that your data did not come from a specific random number generator that you never believed anyway, whereas non-rejection is informative, as it tells you that this aspect of the data

canbe explained by a random number generator, which obviously limits what you can possibly say in that case.I see the RNG playing the role of statistical zero: you don’t expect the misfit of an adequate model to be exactly zero but to be indistinguishable from ‘statistical zero’. To model something you need more than just zero – you need the actual model to check to!

…to0 (forgot a zero)

Kristin, I agree with everyone else that 10 – 15 isn’t enough time to give anything close to a satisfactory treatment on p-values. But I think that taking the time you have to deliver some strong words of warning would be better than nothing at all, and definitely better than a typical “quick intro to p-values” that will only leave students confused or believing false things. I’ve set aside time in my own introductory courses to directly address the ways in which p-values are misused and misinterpreted, and there are a couple of quick and dramatic examples that may be effective. It would help a lot if your audience was already familiar with p-values to some extent, so that you can go directly to addressing common misconceptions and their consequences.

– Define what a p-value is, followed by what it isn’t (“the probability my results were due to chance”, “the probability the null is true”, “the probability my hypothesis is wrong”, “1 minus the probability my hypothesis is correct”). Write the distinction out as P(data|null) vs. P(null|data), where “data” is shorthand for “t(data)>critical value”. Then show examples where confusing these two things leads to dramatically different values – Cohen has a good one in his paper “The Earth is Round”. I like the example of a diagnostic test that has a false positive rate of 1 in 100, and for which the incidence of the condition is also 1 in 100. In this example we have P(positive|no condition) = 0.01 and P(no condition|positive) roughly equals 0.5. The p-value is analogous to the former and the way most people think of the p-value is analogous to the latter.

I like this example because it’s relatively easy to understand, and probably satisfies your requirement of “appropriately brief”. It is a flawed analogy in that it refers to a discrete outcome, whereas most NHST procedures are used on continuous outcomes and so you have the additional major problem that the point null (theta = 0) is impossible. But you aren’t going to address this in a 15 minute presentation.

– When covering the assumptions upon which the p-value depends, I go beyond the usual textbook stuff (e.g. X is iid normal) and emphasize that interpreting the p-value requires you to assume that, had your data been different, you still would have performed the *identical* analysis. This is effective because I think most people who have used p-values in their own work know that it is rarely true. They may rationalize their data dependent decisions (e.g. “this subject didn’t follow protocol”, “if I hadn’t checked the p-value before collecting more data I still would have gotten the same number”, “that outlier reflects a special situation that is beyond the scope of our research question”), but if they know that the p-value requires them to claim that they would have done everything exactly the same way with a different data set, then they might be persuaded to see the p-value as more fragile than they currently see it.

– On a similar note, I think that demonstrations of the noisiness of p-values can be effective and don’t take much time. Geoff Cummings has some good videos on this, e.g. https://www.youtube.com/watch?v=OcJImS16jR4 His “dance of the p-values” demonstration is easy to grasp and will likely surprise most people who have used p-values but don’t understand them well.

This paper has a long list of common misinterpretations that you could include:

“Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, Greenland et al (2016)

If it were me I’d include a statement to the effect of “In many fields of research there are strong incentives for researchers to use p-values and to treat obtaining p<0.05 as their goal. The negative byproducts of this include large amounts of non-replicable findings and exaggerated claims entering the published literature, and researchers employing unsound methods in a cynical but highly incentivized quest to achieve a statistical outcome that neither they nor most of their peers truly understand."

This is an admittedly negative approach, but with 15 minutes I'd be hitting them hard with p-value criticism. The more positive / neutral approach in which we describe what a p-value is and how it is commonly used is what gets taught in standard intro stat and research methods classes, and the result has been that nearly everyone believes false and dangerous things about p-values. Unless a critical approach is taken, students are guaranteed to come away from these classes thinking things like "the p-value tells me the probability that my hypothesis is wrong", or "claims based on statistically significant outcomes are only wrong 5% of the time".

Some good suggestions here, but what you list would take more than 15 minutes — and would need to include things like understanding the distinction between P(data|null) vs. P(null|data), which is a distinction that a lot of people have trouble avoiding.

Yes, I agree. One might be able to pull off one of these things for an audience that was already familiar with p-values, but it would be rushed.

Maybe then the best approach would be to openly state that this is a subject that requires far more time than is available, but to give you a sense of how fraught with danger it is, here are some quotes from Gelmen, Cohen, Meehl, and the ASA…

And then end with the classic “if you think you understand quantum mechanics, you don’t understand quantum mechanics”, with “quantum mechanics” crossed out and “p-values” written in.

That would be a pretty nihilistic take on “interpreting statistical evidence”, but perhaps appropriate given the time constraint.

I would NOT give an uncritical “here’s what p-values are in 15 minutes” talk. Everyone would come away either feeling confused, or even worse feeling like they understood it.

I do understand the indignation. Would the following have attracted the same ire?

Nineteen of 203 patients treated with statins and 10 of 217 patients treated with placebo met the study definition of myalgia (9.4% vs 4.6%. P = .054). This finding did not reach statistical significance, but simulating from a beta-bernoulli model in Stan with flat priors, the probability of the relative risk for myalgia being greater than 1.0 is 94.3%.

This is at least accurate-ish as a description of the calculation. But flat priors are ridiculous. Myalgia at a rate of say 90%+ of patients has 10% plausibility under those priors… Inappropriate given that far less than 9/10 people I’ve known in my life have myalgia.

Don’t disagree.

I wonder if there is a probability statement that could have been made here that would not have attracted criticism.

I’m probably in the wrong job given most of us have backache :)

If they’d done something like a beta(2,8) prior, suggesting that maybe around 20% of people have myalgia they could have given a probability that relative risk was greater than 1 of 96.2%, which *is* above the conventional significance threshold (not that I care about this). I’m not sure it would be uncontroversial, but it’d at least be principled and not based on inappropriately broad priors.

Or beta(4,16) or maybe beta(1,10). And let’s reflect the greater prior probability of mylagia in the statin group etc. etc.

The clinician here has an intuition that the probability of the outcome of interest is greater in treatment compared with control group, despite the NHST being above a given threshold. But he doesn’t know how to express it.

Are there many resources suitable for the non-statistician to help him?

I don’t know what the right prior is. But, there is a substantial literature on mylagia and statins. See, for example, http://www.bmj.com/content/345/bmj.e5348.

That 2012 note advises a practitioner dealing with a patient with uncomfortable muscle aches in the arms to “Explain that myalgia with statins is common, affecting 5-10% of patients in clinical trials of statins.”

A flat prior regarding the extent to which treatment with statins induces myalgia is inappropriate—given all the evidence and experience that are available today. If someone did an experiment with N=200 in both the group treated with statins and the control group and 10 people in each group reported mylagia, it would not change my beliefs about statins and myalgia very much.

It seems to me that the right prior regarding the effects of statins has a big lump of probability near 10% with relatively little probability below 5% or above 20%.

Bob

Surely your beliefs about statis and myalgia are limited to referring to some “frame” or “population”, though.

No one is able to predict why some people get myalgia vs not right? So maybe such a study just found a good subgroup or something changed (some widespread dietary component, eg gluten, became unpopular). I don’t think that is any reason to ignore studies like that, especially when we are largely ignorant about why anything is going on in the system.

Bob, just by eye, your prior matches something like beta(12,87) or similar. Under that prior, my simulation

pt = rbeta(1000,12+19,87+203-19)

pc = rbeta(1000,12+10,87+217-10)

sum(pt/pc > 1)/1000

gives 93% probability that the relative risk is over 1

If you imagine that the control group here was representative of many control groups for myalgia which you could have looked up info on, you could use something like beta(1,21) as the prior for both groups (diffuse relative to the results we have here on the control)

pt = rbeta(1000,1+19,21+203-19)

pc = rbeta(1000,1+10,21+217-10)

sum(pt/pc > 1)/1000

this gives 96% chance of relative risk over 1

However, perhaps a more important question is how much total quality of life benefit or cost the treatment has. Let’s presume the benefit of a year without myalgia is 1, the benefit of a year with myalgia is say .5, a heart attack from which you survive has cost say -20, and reduces your benefit afterwards to .8 and .4 for without and with myalgia, and a heart attack that kills you before age 65 is -20 and there is no benefit after death… (here you can think of these benefits as ratios of two quantities with the same units, like dollars/dollars). Then we need some model for how long you live under treatment vs control (the assumption is statins reduce your heart attack risks etc… this is just a sketch)

Finally, we could come up with something that we actually care about: the benefit/cost to the patient, of giving the treatment, by simulating a few thousands lifespans, one year at a time… I won’t pretend to be able to do this in a blog post. But I think this is the right way to evaluate this treatment, and is how medical research should go.

As for whether there “Are … many resources suitable for the non-statistician to help him?”

I would say, that I hope eventually there will be clinical statisticians who can actually think up and do this kind of Bayesian decision theory calculation, instead of calculating a lot of p values. A medical researcher should have access to them, but probably as of today doesn’t.

This is done very quickly, but I was thinking along the lines of this sort of thing:

https://argonaut.is.ed.ac.uk/shiny/eharrison/two_proportions/

I am really replying to Ewen’s comment—but the system will not accept replies nested that deep.

I clicked on your link (https://argonaut.is.ed.ac.uk/shiny/eharrison/two_proportions/)—very nice. That is the right way to do such inferences. It is much, much better than traditional NHST. By better I mean (1) likely to lead to sound inferences about the state of nature and (2) more likely to yield better insight into treatment alternatives.

However, one thing bothers me. In that analysis you use identical priors beta(12, 88) for the control and treated groups. However, there is substantial evidence that statins induce myalgia in some people. Look at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243897/ and the references cited therein.

So, if one were to consider new data on statins and myalgia and analyze it in the format you presented, it seems to me that the correct priors to use for the control and treated groups would differ.

When I played around with your applet, I felt that priors of beta(6,94) (control group) and beta(12, 88) (treated group). With these priors the absolute risk difference jumps from (Proportion > 0.0 = 92.5%)to (Proportion > 0.0 = 99.2%).

At this point I become a little confused. I find it hard to think of a clinical situation in which that difference would cause someone to choose a different alternative. Perhaps, we are talking about an 8% difference in the weight one attaches to the loss of QALY due to myalgia vs the addition to QALY from taking statins.

Anyway, thanks for creating the applet and posting it. I found it useful.

Bob

Bob, what you’ve rediscovered is that a 95% threshold is a terrible way to make decisions. And, yes, a good way to make decisions would be to do some kind of expected QALY loss/gain or the like.

Thank you Bob, this has been an interesting discussion. We’re getting into the nitty-gritty of a specific prior choice, which was not really my intention :)

The defaults on the applet were just based on the discussion above, not what I would advocate.

I personally think these priors are too specific. Of course, it depends on the question being asked. As a quick for instance, a scan of this (https://www.ncbi.nlm.nih.gov/pubmed?term=23444397) suggests “any myopathy” in European patients included in the study on the statin arm was 0.04% per year. (I’m no expert and I have no idea why it is so low). As you know, the cumulative probability of beta(12,88) even up to 4% is only 0.0006. Yes, the flat prior gives to much weight to unlikely regions of the probability distribution, but this errs in the opposite direction.

My point, is there really a *need* to be so specific. Does the prior distribution as a regularization device require this?

On the point of differential priors for control and treatment groups. Perhaps I’m still trying to throw off the shackles of a frequentist up-bringing and I have too much Popper in me. But I feel an intuitive attraction in having my data fight the null-monster and celebrating the winner.

Ewen: the choice of prior makes a big difference if your decision rule is “did I hit 95%” but it probably makes little difference in a Bayesian analysis where you average the utility over the posterior distribution. What is clear is that there’s a risk of myalgia, the real question which no-one is addressing is: so what? Does the statin benefit outweigh the statin risk/side-effect? This depends on the benefit, which is probably not that well established because again, the kind of statistics that is done is NHST comparing one statin against another or one statin against placebo, in a particular population, in a particular rule for prescribing… etc etc

“the real question which no-one is addressing is: so what”

Thanks Daniel. Far from it. Decision support has become an integral part of clinical medicine. Many clinicians including myself have published decision models, often delivered in app form to be used in day-to-day practice. There are lots of examples for statins as I’m sure you know e.g. https://www.ncbi.nlm.nih.gov/m/pubmed/23920344/

But with an individual patient sitting in front of you in the clinic, all the same caveats apply to decision analyses as any population-based model applied to an individual.

I’d be interested to be directed to anything helpful you have written or read on the interface between decision modelling and randomised clinical trials. As we sit on the threshold of the era of personalised medicine, the population mean effect of an intervention is largely irrelevant. A global effort to gather sufficient high quality data to inform patient choice is lagging behind our ability to deliver individualised therapy, a situation decision modelling is not currently helping.

Ewan: in most of the stuff I’ve seen (which is admittedly, not huge quantities) decisions are based on NHST type considerations and/or sometimes dollar costs. I see your linked article is based on some Bayesian something, but I don’t have access to it so I can’t see what utility function they use. My biggest concern is that utility functions should be utilized and that they make actual sense. Here’s an example of the kind of goofiness I usually see:

http://models.street-artists.org/2016/06/13/no-no-no-no/

http://models.street-artists.org/2016/06/14/what-should-the-analysis-of-acupuncture-for-allergic-rhinitis-have-looked-like/

Ewen,

Speaking as a geezer who seems to spend more and more time in doctors’ offices either with a friend or for myself: Decision apps, etc., although well-intended, run the risk of ignoring the patient’s self-knowledge and preferences. For example, what usually works best for me is when the physician lists the options for treatment and leaves the decision up to me.

Thanks Daniel. Links made me smile.

Couldn’t agree more Martha. Most of my job as a clinician is helping patients to come to the decision which is right for them. Something “Dr Algorithm” may struggle with a little.

What Daniel said. The flat-prior inference is a useful statement in that, yes, it is ridiculous, but its ridiculousness can be traced back to a particular scientific assumption—the flat prior—which can be improved.