Comments on: “This finding did not reach statistical significance, but it indicates a 94.6% probability that statins were responsible for the symptoms.”

By: EMM

EMM — Sat, 09 Sep 2017 19:27:54 +0000

In reply to Farrel Buchinsky. You mean "uninterested" don't you?

By: ojm

ojm — Sat, 12 Aug 2017 23:10:58 +0000

I agree that context is important.

The phrase ‘do the residuals look like noise’ is not to be interpreted sans context. It’s shorthand for ‘am I missing anything’. What you expect noise to look like depends on context, judgement etc.

I was just saying that you could translate this judgement into a hypothesis test and in this context- analysing misfit – it seems to make some sense to me.

Compare misfit to fake, simulated noise and see if you can tell the difference. If you can’t, you have little justification for adopting a more complex (or whatever) model.

By: Martha (Smith)

Martha (Smith) — Sat, 12 Aug 2017 20:05:18 +0000

In reply to Martha (Smith).

I’ve still got a gripe with “Do the residuals look like noise?”

I use the definition, “Data are numbers in context.” So anything that just depends on the numbers (not taking the context into account) is ignoring part of what constitutes the data. Taking context into account, noise could be of various sorts. For example, in talking about brain scans, “noise” could refer to variability between individual brains. (Andrew often uses this meaning for “noise”). It could also refer to measurement inaccuracies (because the measure is not perfect — again, Andrew often talks about this kind of noise), or also to external things that influence the measurement (e.g., literal noise, or vibration, etc. in the case of brain scans.)

By: ojm

ojm — Sat, 12 Aug 2017 03:00:38 +0000

In reply to Martha (Smith).

Martha – sure, I agree. One way to get a feel for possible noise patterns would be to bootstrap the residuals. Or whatever.

Basic point is that a hypothesis test can make sense in this context.

In fact, saying that there are many possible noise patterns is very true and also an argument against adopting any one true model and working within it, and in favour of eg EDA.

By: Daniel Lakeland

Daniel Lakeland — Sat, 12 Aug 2017 02:43:54 +0000

In reply to Martha (Smith). Right, the more explicit statement is "do the residuals look like noise out of this specific RNG noise process" But this can often be a perfectly reasonable question to ask, the key is to teach people that there's no such thing as generic "noise" there's an infinity of possible individual kinds of noise.

By: Martha (Smith)

Martha (Smith) — Sat, 12 Aug 2017 02:15:38 +0000

In reply to ojm. "Do the residuals look like noise?" still doesn't answer the question of what "looks like noise" means.

By: ojm

ojm — Sat, 12 Aug 2017 01:00:41 +0000

In reply to ojm. ...to0 (forgot a zero)

By: ojm

ojm — Sat, 12 Aug 2017 00:58:49 +0000

In reply to Andrew.

I see the RNG playing the role of statistical zero: you don’t expect the misfit of an adequate model to be exactly zero but to be indistinguishable from ‘statistical zero’. To model something you need more than just zero – you need the actual model to check to!

By: Ben Prytherch

Ben Prytherch — Sat, 12 Aug 2017 00:47:35 +0000

In reply to Andrew.

Yeah, I thought the same thing. This might be a case of the “I understand easy statistics, just not hard statistics” phenomenon that also leads people who are “just” doing t-tests and one-way ANOVAs to never talk to a statistician (and so never learn about any deeper errors regarding their data / methods). They know better than to try to use a GLMM with a zero-inflated response and correlated errors by themselves, but if it’s just something simple like pressing the t-test button in SPSS or making up an easy to understand example to show what a p-value is…

By: Andrew

Andrew — Sat, 12 Aug 2017 00:09:35 +0000

In reply to Daniel Lakeland. Yes, I've made this point somewhere too. In the conventional view of things, you learn from a rejection (p less than 0.05 or whatever) but you don't learn anything from a non-rejection (as you can never accept the null hypothesis). But I think that, in the overwhelming majority of cases, it's the other way around: rejection is uninteresting as all it tells you is that your data did not come from a specific random number generator that you never believed anyway, whereas non-rejection is informative, as it tells you that this aspect of the data can be explained by a random number generator, which obviously limits what you can possibly say in that case.

By: ojm

ojm — Sat, 12 Aug 2017 00:06:27 +0000

In reply to Daniel Lakeland.

Yeah I agree, failing to reject is usually more interesting. And yes, doesn’t mean the pattern ‘really’ is from a RNG, since it isn’t (unless you count quantum mechanics or whatever) but that it can be treated as such. You have an adequate model. There may be many others too.

Martha: Draw a straight line through some data. This is not a statistical model, except for the implicit acknowledgement that you don’t have to fit every point (even then this could be thought of as an approximation rather than anything statistical).

Then, to bring in statistics, you could ask ‘do the residuals look like noise?’ and do a formal hypothesis test (if you wanted). Failure to reject can be interpreted as saying your straight line model isn’t ignoring any obvious statistical patterns.

By: Andrew

Andrew — Fri, 11 Aug 2017 23:11:21 +0000

In reply to Ben Prytherch. Hey---I write for Slate! They should've asked me before running that.

By: Ben Prytherch

Ben Prytherch — Fri, 11 Aug 2017 22:04:57 +0000

In reply to Ben Prytherch.

Here’s yet another new example, recently featured on Deborah Mayo’s blog:

http://www.slate.com/articles/health_and_science/science/2017/08/how_will_changing_the_p_value_threshold_affect_the_reproducibility_crisis.html

“P-values are tricky business, but here’s the basics on how they work: Let’s say I’m conducting a drug trial, and I want to know if people who take drug A are more likely to go deaf than if they take drug B. I’ll state that my hypothesis is “drugs A and B are equally likely to make someone go deaf,” administer the drugs, and collect the data. The data will show me the number of people who went deaf on drugs A and B, and the p-value will give me an indication of how likely it is that the difference in deafness was due to random chance rather than the drugs. If the p-value is lower than 0.05, it means that the chance this happened randomly is very small—it’s a 5 percent chance of happening, meaning it would only occur 1 out of 20 times if there wasn’t a difference between the drugs. If the threshold is lowered to 0.005 for something to be considered significant, it would mean that the chances of it happening without a meaningful difference between the treatments would be just 1 in 200.”

And this is from someone who’s been researching the topic enough to write an otherwise pretty good article on it. But he still can’t distinguish between p(data|null) and p(null|data) – or he thinks they’re close enough that they can be used interchangeably.

By: Daniel Lakeland

Daniel Lakeland — Fri, 11 Aug 2017 16:22:55 +0000

In reply to ojm. Yes, that's what I meant since rarely is a real RNG at work, its just that an RNG is adequate. This is why failing to reject is usually more informative. It means you could pretend an RNG is at work without self contradiction. Note this isn't the same as proving an RNG is at work, which is probably the common misinterpretation when failing to reject.

By: Martha (Smith)

Martha (Smith) — Fri, 11 Aug 2017 16:07:14 +0000

In reply to ojm. Not clear what you're trying to ask. What do you mean by "misfit between a (not necessarily statistical) model and the data" and "looks like noise"? A specific example might help clarify.

By: ojm

ojm — Fri, 11 Aug 2017 11:28:02 +0000

In reply to Daniel Lakeland. What if the question is whether the misfit could be adequately explained by a RNG? That is, whether the misfit between a (not necessarily statistical) model and the data 'looks like noise'. Does that still seem so unreasonable?

By: Ewen

Ewen — Fri, 11 Aug 2017 08:28:31 +0000

In reply to Bob. Thanks Daniel. Links made me smile. Couldn't agree more Martha. Most of my job as a clinician is helping patients to come to the decision which is right for them. Something "Dr Algorithm" may struggle with a little.

By: Martha (Smith)

Martha (Smith) — Fri, 11 Aug 2017 00:25:18 +0000

In reply to Bob. Ewen, Speaking as a geezer who seems to spend more and more time in doctors' offices either with a friend or for myself: Decision apps, etc., although well-intended, run the risk of ignoring the patient's self-knowledge and preferences. For example, what usually works best for me is when the physician lists the options for treatment and leaves the decision up to me.

By: Daniel Lakeland

Daniel Lakeland — Thu, 10 Aug 2017 22:42:21 +0000

In reply to Bob.

Ewan: in most of the stuff I’ve seen (which is admittedly, not huge quantities) decisions are based on NHST type considerations and/or sometimes dollar costs. I see your linked article is based on some Bayesian something, but I don’t have access to it so I can’t see what utility function they use. My biggest concern is that utility functions should be utilized and that they make actual sense. Here’s an example of the kind of goofiness I usually see:

http://models.street-artists.org/2016/06/13/no-no-no-no/

http://models.street-artists.org/2016/06/14/what-should-the-analysis-of-acupuncture-for-allergic-rhinitis-have-looked-like/

By: Ewen

Ewen — Thu, 10 Aug 2017 22:06:08 +0000

In reply to Bob.

“the real question which no-one is addressing is: so what”

Thanks Daniel. Far from it. Decision support has become an integral part of clinical medicine. Many clinicians including myself have published decision models, often delivered in app form to be used in day-to-day practice. There are lots of examples for statins as I’m sure you know e.g. https://www.ncbi.nlm.nih.gov/m/pubmed/23920344/

But with an individual patient sitting in front of you in the clinic, all the same caveats apply to decision analyses as any population-based model applied to an individual.

I’d be interested to be directed to anything helpful you have written or read on the interface between decision modelling and randomised clinical trials. As we sit on the threshold of the era of personalised medicine, the population mean effect of an intervention is largely irrelevant. A global effort to gather sufficient high quality data to inform patient choice is lagging behind our ability to deliver individualised therapy, a situation decision modelling is not currently helping.

By: Daniel Lakeland

Daniel Lakeland — Thu, 10 Aug 2017 17:34:01 +0000

In reply to Bob. Ewen: the choice of prior makes a big difference if your decision rule is "did I hit 95%" but it probably makes little difference in a Bayesian analysis where you average the utility over the posterior distribution. What is clear is that there's a risk of myalgia, the real question which no-one is addressing is: so what? Does the statin benefit outweigh the statin risk/side-effect? This depends on the benefit, which is probably not that well established because again, the kind of statistics that is done is NHST comparing one statin against another or one statin against placebo, in a particular population, in a particular rule for prescribing... etc etc

By: Ewen

Ewen — Thu, 10 Aug 2017 17:01:37 +0000

In reply to Bob.

Thank you Bob, this has been an interesting discussion. We’re getting into the nitty-gritty of a specific prior choice, which was not really my intention :)

The defaults on the applet were just based on the discussion above, not what I would advocate.

I personally think these priors are too specific. Of course, it depends on the question being asked. As a quick for instance, a scan of this (https://www.ncbi.nlm.nih.gov/pubmed?term=23444397) suggests “any myopathy” in European patients included in the study on the statin arm was 0.04% per year. (I’m no expert and I have no idea why it is so low). As you know, the cumulative probability of beta(12,88) even up to 4% is only 0.0006. Yes, the flat prior gives to much weight to unlikely regions of the probability distribution, but this errs in the opposite direction.

My point, is there really a *need* to be so specific. Does the prior distribution as a regularization device require this?

On the point of differential priors for control and treatment groups. Perhaps I’m still trying to throw off the shackles of a frequentist up-bringing and I have too much Popper in me. But I feel an intuitive attraction in having my data fight the null-monster and celebrating the winner.

By: Daniel Lakeland

Daniel Lakeland — Thu, 10 Aug 2017 13:51:39 +0000

In reply to Bob. Bob, what you've rediscovered is that a 95% threshold is a terrible way to make decisions. And, yes, a good way to make decisions would be to do some kind of expected QALY loss/gain or the like.

By: Bob

Bob — Thu, 10 Aug 2017 12:56:15 +0000

In reply to Bob.

I am really replying to Ewen’s comment—but the system will not accept replies nested that deep.

I clicked on your link (https://argonaut.is.ed.ac.uk/shiny/eharrison/two_proportions/)—very nice. That is the right way to do such inferences. It is much, much better than traditional NHST. By better I mean (1) likely to lead to sound inferences about the state of nature and (2) more likely to yield better insight into treatment alternatives.

However, one thing bothers me. In that analysis you use identical priors beta(12, 88) for the control and treated groups. However, there is substantial evidence that statins induce myalgia in some people. Look at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243897/ and the references cited therein.

So, if one were to consider new data on statins and myalgia and analyze it in the format you presented, it seems to me that the correct priors to use for the control and treated groups would differ.

When I played around with your applet, I felt that priors of beta(6,94) (control group) and beta(12, 88) (treated group). With these priors the absolute risk difference jumps from (Proportion > 0.0 = 92.5%)to (Proportion > 0.0 = 99.2%).

At this point I become a little confused. I find it hard to think of a clinical situation in which that difference would cause someone to choose a different alternative. Perhaps, we are talking about an 8% difference in the weight one attaches to the loss of QALY due to myalgia vs the addition to QALY from taking statins.

Anyway, thanks for creating the applet and posting it. I found it useful.

Bob

By: Anoneuoid

Anoneuoid — Wed, 09 Aug 2017 14:52:10 +0000

In reply to Carlos Ungil.

B) The quality control deparment in a dice factory wants to verify that there are no manufacturing issues. One of the tests is to check whether high numbers appear too often (the faces with low numbers are slightly heavier).

Do they actually do this? I found some info where they do not describe such testing, instead they rely on careful control of the materials and manufacturing process:

To ensure that each die produced meets specified quality standards, a number of quality control measures are taken. Prior to manufacturing, certain physical and chemical properties of the incoming plastic raw materials are checked. This includes things such as molecular weight determinations, chemical composition studies, and visual inspection of the appearance. More rigorous testing may also be done. For example, stress-strain testing can be performed to determine the strength of the plastic. Impact tests help determine the toughness of the plastic. During manufacture, line inspectors are stationed at various points on the production line. Here, they visually check the plastic parts to make sure they are shaped, sized and colored correctly. They also check the integrity of the final packaging. If any defective dice are found, they are removed from the production line and set aside for reforming. Computers are also used to control plastic use, mold retention time, and line speed.

http://www.madehow.com/Volume-4/Dice.html

Another source missing a description of NHST:
http://www.midwestgamesupply.com/dice_manufacturing.htm

Here, dice from two manufacturers are tested and it is found that *neither* passes the “randomness test” (and probably none exist that do):
http://www.awesomedice.com/blog/353/d20-dice-randomness-test-chessex-vs-gamescience/

The great thing at that last link is that they come up with a theory to explain the distribution of rolls for one of the dice (not just why there should be a significant p-value), and… do a direct replication study! That is so much better science than most of what is getting published these days.

tl;dr: NHST is not even useful for checking dice.

By: Ewen

Ewen — Wed, 09 Aug 2017 14:21:11 +0000

In reply to Daniel Lakeland.

This is done very quickly, but I was thinking along the lines of this sort of thing:
https://argonaut.is.ed.ac.uk/shiny/eharrison/two_proportions/

By: Martha (Smith)

Martha (Smith) — Tue, 08 Aug 2017 23:47:44 +0000

In reply to Martha (Smith).

Carlos said,

“I agree that people has an amazing capability to misinterpret p-values. But that’s mostly a problem with people, not a problem with p-values :-)”

But if we are talking about teaching, we are talking about people. So giving an explanation is not the same as people understanding it well enough to use it correctly — and the latter is the goal.

By: Carlos Ungil

Carlos Ungil — Tue, 08 Aug 2017 23:35:46 +0000

In reply to Martha (Smith).

Of course one cannot make miracles in 15 minutes. But I think one has to use a simple example, and not just because describing a complex example takes longer. A real-world example gives too many opportunities to get lost in the details.

A toy problem is enough to point at the obvious issues with p-values like depending only on the null hypothesis (and not on the hypothesis of interest) and not including any information about how likely is the null hypothesis to be true ex-ante (let alone the alternative).

One would intuitively expect the probability of a loaded die given the data in the examples above to have the ordering A > B > C (cheating suspected > routine check > telekinesis impossible). The p-value is the same, though. “Rejecting the null” equally in the tree cases doesn’t make much sense, and for the telekinesis example even if we “reject the null” we shouldn’t accept the alternative at face value.

I agree that people has an amazing capability to misinterpret p-values. But that’s mostly a problem with people, not a problem with p-values :-)

The

By: Martha (Smith)

Martha (Smith) — Tue, 08 Aug 2017 22:59:59 +0000

In reply to Carlos Ungil.

Carlos,

Whether or not probability is taught in high school depends on the country (and in the U.S., on the state or even the individual school district).

Your examples are nice ones, but need to be followed up by (and connected to) examples (such as those in most scientific research) which are not as straightforward as your examples.

By: Daniel Lakeland

Daniel Lakeland — Tue, 08 Aug 2017 22:54:03 +0000

In reply to Daniel Lakeland.

Right, but explaining that to people… that’s the hard part, because we’re talking about overcoming what seems to be a built-in misinterpretation, people are natural bayesians, they want the p value to mean “the probability that the null hypothesis is true” and they want 1-p to mean “the probability that my favorite hypothesis is true” ;-)

By: Carlos Ungil

Carlos Ungil — Tue, 08 Aug 2017 22:47:33 +0000

In reply to Daniel Lakeland. It doesn't matter, because the p-value doesn't have anything to do with loaded dice! :-)

By: Daniel Lakeland

Daniel Lakeland — Tue, 08 Aug 2017 22:41:58 +0000

In reply to Carlos Ungil.

You need to have a discussion about “what does it mean… the probability that *this* die is loaded?”

From a frequency perspective, it’s simple to say “how often would we get a result more unusual than this if the die is totally fair and we repeat the experiment very very many times” but does this mean something close to “what is the (1-probability) that *this* die is loaded?”

driving home that distinction… not so easy

By: Carlos Ungil

Carlos Ungil — Tue, 08 Aug 2017 22:33:52 +0000

In reply to Martha (Smith).

Don’t they learn probability in high school? Probably they understand at least that a die has six faces and if I toss the die the probability of each outcome 1,2,3,4,5,6 is 1/6.

A) I suspect someone is cheating and using a loaded die which shows a high number (4, 5, or 6) more often than it should (the students will probably agree that the probability of this event is 1/2).

To check whether the die is loaded, I define the null “it is not loaded and the probability of getting a high number is 1/2” and I toss the die 8 times. I get high numbers in 7 occasions and only once a low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).
What is the probability that the die is loaded, given our data? We don't know!

B) The quality control deparment in a dice factory wants to verify that there are no manufacturing issues. One of the tests is to check whether high numbers appear too often (the faces with low numbers are slightly heavier).

The null hypothesis is, as before, that the probability is 1/2, I toss the die 8 times and the result is 7 high numbers and 1 low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).
What is the probability that the die we are checking is biased, given our data? We don't know!

C) Uri Geller claims telekinetic powers and we perform an experiment to see if it true. We will throw a die (perfectly fair as far as we know) 8 times, and record how often he "forces" a high number (4, 5, or 6).

The null hypothesis is, as before, that the probability is 1/2, I toss the die 8 times and the result is 7 high numbers and 1 low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).
What is the probability that Uri Geller has telekinetic powers, given our data? We don't know!

By: Martha (Smith)

Martha (Smith) — Tue, 08 Aug 2017 22:33:48 +0000

In reply to Kyle C. Kyle: Still not definitive, but adding some things you omitted: "the probability of seeing a result that is at least this far from the null (as measured by the same variable, with a sample of the same size as used in the study), provided the statistical model of the phenomenon is correct and the null hypothesis is true (bearing in mind that the latter two conditions are rarely if ever met.)

By: Anoneuoid

Anoneuoid — Tue, 08 Aug 2017 21:37:44 +0000

In reply to Bob Calin-Jageman. So do you teach that the "true" parameter has a 95% probability of being inside the 95% confidence interval, or no?

By: Ben Prytherch

Ben Prytherch — Tue, 08 Aug 2017 21:10:27 +0000

In reply to Kyle C. +1 Jason. The better we explain this, the more our students will understandably wonder why everyone is using it as the principle means of establishing "statistical evidence".

By: Ben Prytherch

Ben Prytherch — Tue, 08 Aug 2017 21:09:11 +0000

In reply to Kyle C. Haha, yes. I could do that! I should probably also take on: "Despite the fact that everything about the way you see this method being used in practice will point you toward beliving the inverse probability definition that you find so intuitive, you must resist this. It doesn't matter that the whole act of using an estimated probability to declare things significant or not significant would make far more sense were your intuition correct. Your intuition is no correct. If this confuses you, embrace the confusion."

By: Jason Yamada-Hanff

Jason Yamada-Hanff — Tue, 08 Aug 2017 20:34:20 +0000

In reply to Kyle C. Well, of course, then the obvious question is "if that's the only valid meaning, why do I care at all?" which leads back to Ben's option (3).

By: Kyle C

Kyle C — Tue, 08 Aug 2017 20:18:31 +0000

In reply to Ben Prytherch.

Thanks, Ben. Can’t you tell your students, “This is what it means,* no more, no less, there is no correct paraphrase, if anyone says anything different, they are wrong. If you give the inverse probability or ‘due to chance’ definition in this class, you will fail.”?

* I may butcher the definition myself, but something like, the probability of seeing a result that is this far from the null, if and only if our statistical model of the phenomenon were precisely correct and the null were precisely true, neither of which we ever believe.

By: Bob Calin-Jageman

Bob Calin-Jageman — Tue, 08 Aug 2017 15:52:05 +0000

In reply to Anoneuoid.

First, most people have an intuitive sense of what a confidence interval is, which is essentially “a range of plausible/credible values for the truth’.

That intuitive conception is *not* strictly correct, and many other commonly-adopted interpretations of frequentist-based confidence intervals are also technically incorrect (as extensively documented, for example, by Richard Morey). But I would argue that in general the misconceptions related to frequentist CIs are relatively benign compared to those related to p values. For someone intepreting data on a question about which they have relatively little prior information, I don’t think they can go too very far wrong with thoughtfully interpreting a CI (assuming, of course, the measurements are reasonable, etc.). If you have examples where that’s not the case, I’d love to learn about them.

For those who want to dig a bit deeper into where CIs come from, simulations can be really helpful (like this one: http://rpsychologist.com/new-d3-js-visualization-interpreting-confidence-intervals )

Finally, credible intervals are great. If someone wants to start there or, learn both approaches, that’s awesome. My point is that for those who think they want to know about p values, understanding them in terms of frequentist CIs can a) give them a much better sense of what ‘statistically significant’ means, and b) give them the tools to convert p values to CIs to hopefully be more thoughtful and less dichotomous in their thinking about the data obtained.

By: Bob Calin-Jageman

Bob Calin-Jageman — Tue, 08 Aug 2017 15:36:21 +0000

In reply to Martha (Smith). Yep, agreed. But the question was how to explain p values, so of course hurtful dichotomous thinking was going to come up. I'd rather not teach them at all, not only to avoid dichotomous thinking but also because the comment from Ben is pretty spot on: it's mostly a fool's errand; most students just won't be able to resist gravitating towards erroneous conceptions of what p means.

By: Steven Vnnoy

Steven Vnnoy — Tue, 08 Aug 2017 15:16:49 +0000

In reply to Thanatos Savehn. Before I read the comments, I too went to find the original article and was at first confused by the discrepancy. Interestingly, there is a digital artifact. If you paste in the quote as Jackson has it in his comment above, Google Scholar finds it verbatim But then when you retrieve the actual article the offending clause is gone. There is a statement of correction as follows: "This Viewpoint was corrected December 12, 2016, to change language regarding the STOMP trial" which is pretty disingenuous. I'm just contributing this comment to highlight the concern that it is hard to completely correct mistakes in printed materials these days, making even understandable errors substantially problematic.

By: Daniel Lakeland

Daniel Lakeland — Tue, 08 Aug 2017 14:32:55 +0000

In reply to Bob.

Bob, just by eye, your prior matches something like beta(12,87) or similar. Under that prior, my simulation

pt = rbeta(1000,12+19,87+203-19)
pc = rbeta(1000,12+10,87+217-10)

sum(pt/pc > 1)/1000

gives 93% probability that the relative risk is over 1

If you imagine that the control group here was representative of many control groups for myalgia which you could have looked up info on, you could use something like beta(1,21) as the prior for both groups (diffuse relative to the results we have here on the control)

pt = rbeta(1000,1+19,21+203-19)
pc = rbeta(1000,1+10,21+217-10)

sum(pt/pc > 1)/1000

this gives 96% chance of relative risk over 1

However, perhaps a more important question is how much total quality of life benefit or cost the treatment has. Let’s presume the benefit of a year without myalgia is 1, the benefit of a year with myalgia is say .5, a heart attack from which you survive has cost say -20, and reduces your benefit afterwards to .8 and .4 for without and with myalgia, and a heart attack that kills you before age 65 is -20 and there is no benefit after death… (here you can think of these benefits as ratios of two quantities with the same units, like dollars/dollars). Then we need some model for how long you live under treatment vs control (the assumption is statins reduce your heart attack risks etc… this is just a sketch)

Finally, we could come up with something that we actually care about: the benefit/cost to the patient, of giving the treatment, by simulating a few thousands lifespans, one year at a time… I won’t pretend to be able to do this in a blog post. But I think this is the right way to evaluate this treatment, and is how medical research should go.

As for whether there “Are … many resources suitable for the non-statistician to help him?”

I would say, that I hope eventually there will be clinical statisticians who can actually think up and do this kind of Bayesian decision theory calculation, instead of calculating a lot of p values. A medical researcher should have access to them, but probably as of today doesn’t.

By: Anoneuoid

Anoneuoid — Tue, 08 Aug 2017 13:43:45 +0000

In reply to Bob.

If someone did an experiment with N=200 in both the group treated with statins and the control group and 10 people in each group reported mylagia, it would not change my beliefs about statins and myalgia very much.

Surely your beliefs about statis and myalgia are limited to referring to some "frame" or "population", though. No one is able to predict why some people get myalgia vs not right? So maybe such a study just found a good subgroup or something changed (some widespread dietary component, eg gluten, became unpopular). I don't think that is any reason to ignore studies like that, especially when we are largely ignorant about why anything is going on in the system.

By: Bob

Bob — Tue, 08 Aug 2017 12:57:19 +0000

In reply to Daniel Lakeland.

I don’t know what the right prior is. But, there is a substantial literature on mylagia and statins. See, for example, http://www.bmj.com/content/345/bmj.e5348.

That 2012 note advises a practitioner dealing with a patient with uncomfortable muscle aches in the arms to “Explain that myalgia with statins is common, affecting 5-10% of patients in clinical trials of statins.”

A flat prior regarding the extent to which treatment with statins induces myalgia is inappropriate—given all the evidence and experience that are available today. If someone did an experiment with N=200 in both the group treated with statins and the control group and 10 people in each group reported mylagia, it would not change my beliefs about statins and myalgia very much.

It seems to me that the right prior regarding the effects of statins has a big lump of probability near 10% with relatively little probability below 5% or above 20%.

Bob

By: Ewen

Ewen — Tue, 08 Aug 2017 10:34:05 +0000

In reply to Daniel Lakeland.

Or beta(4,16) or maybe beta(1,10). And let’s reflect the greater prior probability of mylagia in the statin group etc. etc.

The clinician here has an intuition that the probability of the outcome of interest is greater in treatment compared with control group, despite the NHST being above a given threshold. But he doesn’t know how to express it.

Are there many resources suitable for the non-statistician to help him?

By: Andrew

Andrew — Tue, 08 Aug 2017 01:02:51 +0000

In reply to Ewen.

What Daniel said. The flat-prior inference is a useful statement in that, yes, it is ridiculous, but its ridiculousness can be traced back to a particular scientific assumption—the flat prior—which can be improved.

By: Daniel Lakeland

Daniel Lakeland — Tue, 08 Aug 2017 00:59:32 +0000

In reply to Ewen. If they'd done something like a beta(2,8) prior, suggesting that maybe around 20% of people have myalgia they could have given a probability that relative risk was greater than 1 of 96.2%, which *is* above the conventional significance threshold (not that I care about this). I'm not sure it would be uncontroversial, but it'd at least be principled and not based on inappropriately broad priors.

By: Ben Prytherch

Ben Prytherch — Mon, 07 Aug 2017 20:44:17 +0000

In reply to Martha (Smith).

Yes, I agree. One might be able to pull off one of these things for an audience that was already familiar with p-values, but it would be rushed.

Maybe then the best approach would be to openly state that this is a subject that requires far more time than is available, but to give you a sense of how fraught with danger it is, here are some quotes from Gelmen, Cohen, Meehl, and the ASA…

And then end with the classic “if you think you understand quantum mechanics, you don’t understand quantum mechanics”, with “quantum mechanics” crossed out and “p-values” written in.

That would be a pretty nihilistic take on “interpreting statistical evidence”, but perhaps appropriate given the time constraint.

I would NOT give an uncritical “here’s what p-values are in 15 minutes” talk. Everyone would come away either feeling confused, or even worse feeling like they understood it.

By: Farrel Buchinsky

Farrel Buchinsky — Mon, 07 Aug 2017 20:40:21 +0000

In reply to Z.

The problem is incredibly widespread in medicine. Most practitioners and academics fall into the misconceptions that are oh so tempting. Only two types avoid the temptation.
1) The extraordinarily numerate (very few people)
2) Those completely disinterested in data (the storytelling folk), but even they fall victim to the stories being told by those who succumbed to temptation.
Gerd Gigerenzer describes a SIC Syndrome amongst doctors (Self defence, Innumeracy, Conflict of interests). The sad thing is that so often in medicine there is no data to illuminate a particular question and then when there is data, it simply becomes raw material for the SIC.

Comments on: “This finding did not reach statistical sig­nificance, but it indicates a 94.6% prob­ability that statins were responsible for the symptoms.”

By: EMM

By: ojm

By: Martha (Smith)

By: ojm

By: Daniel Lakeland

By: Martha (Smith)

By: ojm

By: ojm

By: Ben Prytherch

By: Andrew

By: ojm

By: Andrew

By: Ben Prytherch

By: Daniel Lakeland

By: Martha (Smith)

By: ojm

By: Ewen

By: Martha (Smith)

By: Daniel Lakeland

By: Ewen

By: Daniel Lakeland

By: Ewen

By: Daniel Lakeland

By: Bob

By: Anoneuoid

By: Ewen

By: Martha (Smith)

By: Carlos Ungil

By: Martha (Smith)

By: Daniel Lakeland

By: Carlos Ungil

By: Daniel Lakeland

By: Carlos Ungil

By: Martha (Smith)

By: Anoneuoid

By: Ben Prytherch

By: Ben Prytherch

By: Jason Yamada-Hanff

By: Kyle C

By: Bob Calin-Jageman

By: Bob Calin-Jageman

By: Steven Vnnoy

By: Daniel Lakeland

By: Anoneuoid

By: Bob

By: Ewen

By: Andrew

By: Daniel Lakeland

By: Ben Prytherch

By: Farrel Buchinsky

Comments on: “This finding did not reach statistical significance, but it indicates a 94.6% probability that statins were responsible for the symptoms.”