The phrase ‘do the residuals look like noise’ is not to be interpreted sans context. It’s shorthand for ‘am I missing anything’. What you expect noise to look like depends on context, judgement etc.

I was just saying that you could translate this judgement into a hypothesis test and in this context- analysing misfit – it seems to make some sense to me.

Compare misfit to fake, simulated noise and see if you can tell the difference. If you can’t, you have little justification for adopting a more complex (or whatever) model.

]]>I use the definition, “Data are numbers in context.” So anything that just depends on the numbers (not taking the context into account) is ignoring part of what constitutes the data. Taking context into account, noise could be of various sorts. For example, in talking about brain scans, “noise” could refer to variability between individual brains. (Andrew often uses this meaning for “noise”). It could also refer to measurement inaccuracies (because the measure is not perfect — again, Andrew often talks about this kind of noise), or also to external things that influence the measurement (e.g., literal noise, or vibration, etc. in the case of brain scans.)

]]>Basic point is that a hypothesis test can make sense in this context.

In fact, saying that there are many possible noise patterns is very true and also an argument against adopting any one true model and working within it, and in favour of eg EDA.

]]>Martha: Draw a straight line through some data. This is not a statistical model, except for the implicit acknowledgement that you don’t have to fit every point (even then this could be thought of as an approximation rather than anything statistical).

Then, to bring in statistics, you could ask ‘do the residuals look like noise?’ and do a formal hypothesis test (if you wanted). Failure to reject can be interpreted as saying your straight line model isn’t ignoring any obvious statistical patterns.

]]>“P-values are tricky business, but here’s the basics on how they work: Let’s say I’m conducting a drug trial, and I want to know if people who take drug A are more likely to go deaf than if they take drug B. I’ll state that my hypothesis is “drugs A and B are equally likely to make someone go deaf,” administer the drugs, and collect the data. The data will show me the number of people who went deaf on drugs A and B, and the p-value will give me an indication of how likely it is that the difference in deafness was due to random chance rather than the drugs. If the p-value is lower than 0.05, it means that the chance this happened randomly is very small—it’s a 5 percent chance of happening, meaning it would only occur 1 out of 20 times if there wasn’t a difference between the drugs. If the threshold is lowered to 0.005 for something to be considered significant, it would mean that the chances of it happening without a meaningful difference between the treatments would be just 1 in 200.”

And this is from someone who’s been researching the topic enough to write an otherwise pretty good article on it. But he still can’t distinguish between p(data|null) and p(null|data) – or he thinks they’re close enough that they can be used interchangeably.

]]>Note this isn’t the same as proving an RNG is at work, which is probably the common misinterpretation when failing to reject.

]]>Couldn’t agree more Martha. Most of my job as a clinician is helping patients to come to the decision which is right for them. Something “Dr Algorithm” may struggle with a little.

]]>Speaking as a geezer who seems to spend more and more time in doctors’ offices either with a friend or for myself: Decision apps, etc., although well-intended, run the risk of ignoring the patient’s self-knowledge and preferences. For example, what usually works best for me is when the physician lists the options for treatment and leaves the decision up to me.

]]>Thanks Daniel. Far from it. Decision support has become an integral part of clinical medicine. Many clinicians including myself have published decision models, often delivered in app form to be used in day-to-day practice. There are lots of examples for statins as I’m sure you know e.g. https://www.ncbi.nlm.nih.gov/m/pubmed/23920344/

But with an individual patient sitting in front of you in the clinic, all the same caveats apply to decision analyses as any population-based model applied to an individual.

I’d be interested to be directed to anything helpful you have written or read on the interface between decision modelling and randomised clinical trials. As we sit on the threshold of the era of personalised medicine, the population mean effect of an intervention is largely irrelevant. A global effort to gather sufficient high quality data to inform patient choice is lagging behind our ability to deliver individualised therapy, a situation decision modelling is not currently helping.

]]>The defaults on the applet were just based on the discussion above, not what I would advocate.

I personally think these priors are too specific. Of course, it depends on the question being asked. As a quick for instance, a scan of this (https://www.ncbi.nlm.nih.gov/pubmed?term=23444397) suggests “any myopathy” in European patients included in the study on the statin arm was 0.04% per year. (I’m no expert and I have no idea why it is so low). As you know, the cumulative probability of beta(12,88) even up to 4% is only 0.0006. Yes, the flat prior gives to much weight to unlikely regions of the probability distribution, but this errs in the opposite direction.

My point, is there really a *need* to be so specific. Does the prior distribution as a regularization device require this?

On the point of differential priors for control and treatment groups. Perhaps I’m still trying to throw off the shackles of a frequentist up-bringing and I have too much Popper in me. But I feel an intuitive attraction in having my data fight the null-monster and celebrating the winner.

]]>I clicked on your link (https://argonaut.is.ed.ac.uk/shiny/eharrison/two_proportions/)—very nice. That is the right way to do such inferences. It is much, much better than traditional NHST. By better I mean (1) likely to lead to sound inferences about the state of nature and (2) more likely to yield better insight into treatment alternatives.

However, one thing bothers me. In that analysis you use identical priors beta(12, 88) for the control and treated groups. However, there is substantial evidence that statins induce myalgia in some people. Look at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243897/ and the references cited therein.

So, if one were to consider new data on statins and myalgia and analyze it in the format you presented, it seems to me that the correct priors to use for the control and treated groups would differ.

When I played around with your applet, I felt that priors of beta(6,94) (control group) and beta(12, 88) (treated group). With these priors the absolute risk difference jumps from (Proportion > 0.0 = 92.5%)to (Proportion > 0.0 = 99.2%).

At this point I become a little confused. I find it hard to think of a clinical situation in which that difference would cause someone to choose a different alternative. Perhaps, we are talking about an 8% difference in the weight one attaches to the loss of QALY due to myalgia vs the addition to QALY from taking statins.

Anyway, thanks for creating the applet and posting it. I found it useful.

Bob

]]>B) The quality control deparment in a dice factory wants to verify that there are no manufacturing issues. One of the tests is to check whether high numbers appear too often (the faces with low numbers are slightly heavier).

Do they actually do this? I found some info where they do not describe such testing, instead they rely on careful control of the materials and manufacturing process:

To ensure that each die produced meets specified quality standards, a number of quality control measures are taken. Prior to manufacturing, certain physical and chemical properties of the incoming plastic raw materials are checked. This includes things such as molecular weight determinations, chemical composition studies, and visual inspection of the appearance. More rigorous testing may also be done. For example, stress-strain testing can be performed to determine the strength of the plastic. Impact tests help determine the toughness of the plastic. During manufacture, line inspectors are stationed at various points on the production line. Here, they visually check the plastic parts to make sure they are shaped, sized and colored correctly. They also check the integrity of the final packaging. If any defective dice are found, they are removed from the production line and set aside for reforming. Computers are also used to control plastic use, mold retention time, and line speed.

http://www.madehow.com/Volume-4/Dice.html

Another source missing a description of NHST:

http://www.midwestgamesupply.com/dice_manufacturing.htm

Here, dice from two manufacturers are tested and it is found that *neither* passes the “randomness test” (and probably none exist that do):

http://www.awesomedice.com/blog/353/d20-dice-randomness-test-chessex-vs-gamescience/

The great thing at that last link is that they come up with a theory to explain the distribution of rolls for one of the dice (not just why there should be a significant p-value), and… do a direct replication study! That is so much better science than most of what is getting published these days.

tl;dr: NHST is not even useful for checking dice.

]]>https://argonaut.is.ed.ac.uk/shiny/eharrison/two_proportions/ ]]>

“I agree that people has an amazing capability to misinterpret p-values. But that’s mostly a problem with people, not a problem with p-values :-)”

But if we are talking about teaching, we are talking about people. So giving an explanation is not the same as people understanding it well enough to use it correctly — and the latter is the goal.

]]>A toy problem is enough to point at the obvious issues with p-values like depending only on the null hypothesis (and not on the hypothesis of interest) and not including any information about how likely is the null hypothesis to be true ex-ante (let alone the alternative).

One would intuitively expect the probability of a loaded die given the data in the examples above to have the ordering A > B > C (cheating suspected > routine check > telekinesis impossible). The p-value is the same, though. “Rejecting the null” equally in the tree cases doesn’t make much sense, and for the telekinesis example even if we “reject the null” we shouldn’t accept the alternative at face value.

I agree that people has an amazing capability to misinterpret p-values. But that’s mostly a problem with people, not a problem with p-values :-)

The

]]>Whether or not probability is taught in high school depends on the country (and in the U.S., on the state or even the individual school district).

Your examples are nice ones, but need to be followed up by (and connected to) examples (such as those in most scientific research) which are not as straightforward as your examples.

]]>From a frequency perspective, it’s simple to say “how often would we get a result more unusual than this if the die is totally fair and we repeat the experiment very very many times” but does this mean something close to “what is the (1-probability) that *this* die is loaded?”

driving home that distinction… not so easy

]]>A) I suspect someone is cheating and using a loaded die which shows a high number (4, 5, or 6) more often than it should (the students will probably agree that the probability of this event is 1/2).

To check whether the die is loaded, I define the null “it is not loaded and the probability of getting a high number is 1/2” and I toss the die 8 times. I get high numbers in 7 occasions and only once a low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).

What is the probability that the die is loaded, given our data? We don't know!

B) The quality control deparment in a dice factory wants to verify that there are no manufacturing issues. One of the tests is to check whether high numbers appear too often (the faces with low numbers are slightly heavier).

The null hypothesis is, as before, that the probability is 1/2, I toss the die 8 times and the result is 7 high numbers and 1 low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).

What is the probability that the die we are checking is biased, given our data? We don't know!

C) Uri Geller claims telekinetic powers and we perform an experiment to see if it true. We will throw a die (perfectly fair as far as we know) 8 times, and record how often he "forces" a high number (4, 5, or 6).

The null hypothesis is, as before, that the probability is 1/2, I toss the die 8 times and the result is 7 high numbers and 1 low number.

The p-value is 0.035 < 0.05 (the probability of getting 7 or 8 high numbers out of 8, assuming the die is fair).

What is the probability that Uri Geller has telekinetic powers, given our data? We don't know!

“the probability of seeing a result that is at least this far from the null (as measured by the same variable, with a sample of the same size as used in the study), provided the statistical model of the phenomenon is correct and the null hypothesis is true (bearing in mind that the latter two conditions are rarely if ever met.)

]]>* I may butcher the definition myself, but something like, the probability of seeing a result that is this far from the null, if and only if our statistical model of the phenomenon were precisely correct and the null were precisely true, neither of which we ever believe.

]]>That intuitive conception is *not* strictly correct, and many other commonly-adopted interpretations of frequentist-based confidence intervals are also technically incorrect (as extensively documented, for example, by Richard Morey). But I would argue that in general the misconceptions related to frequentist CIs are relatively benign compared to those related to p values. For someone intepreting data on a question about which they have relatively little prior information, I don’t think they can go too very far wrong with thoughtfully interpreting a CI (assuming, of course, the measurements are reasonable, etc.). If you have examples where that’s not the case, I’d love to learn about them.

For those who want to dig a bit deeper into where CIs come from, simulations can be really helpful (like this one: http://rpsychologist.com/new-d3-js-visualization-interpreting-confidence-intervals )

Finally, credible intervals are great. If someone wants to start there or, learn both approaches, that’s awesome. My point is that for those who think they want to know about p values, understanding them in terms of frequentist CIs can a) give them a much better sense of what ‘statistically significant’ means, and b) give them the tools to convert p values to CIs to hopefully be more thoughtful and less dichotomous in their thinking about the data obtained.

]]>I’m just contributing this comment to highlight the concern that it is hard to completely correct mistakes in printed materials these days, making even understandable errors substantially problematic.

]]>pt = rbeta(1000,12+19,87+203-19)

pc = rbeta(1000,12+10,87+217-10)

sum(pt/pc > 1)/1000

gives 93% probability that the relative risk is over 1

If you imagine that the control group here was representative of many control groups for myalgia which you could have looked up info on, you could use something like beta(1,21) as the prior for both groups (diffuse relative to the results we have here on the control)

pt = rbeta(1000,1+19,21+203-19)

pc = rbeta(1000,1+10,21+217-10)

sum(pt/pc > 1)/1000

this gives 96% chance of relative risk over 1

However, perhaps a more important question is how much total quality of life benefit or cost the treatment has. Let’s presume the benefit of a year without myalgia is 1, the benefit of a year with myalgia is say .5, a heart attack from which you survive has cost say -20, and reduces your benefit afterwards to .8 and .4 for without and with myalgia, and a heart attack that kills you before age 65 is -20 and there is no benefit after death… (here you can think of these benefits as ratios of two quantities with the same units, like dollars/dollars). Then we need some model for how long you live under treatment vs control (the assumption is statins reduce your heart attack risks etc… this is just a sketch)

Finally, we could come up with something that we actually care about: the benefit/cost to the patient, of giving the treatment, by simulating a few thousands lifespans, one year at a time… I won’t pretend to be able to do this in a blog post. But I think this is the right way to evaluate this treatment, and is how medical research should go.

As for whether there “Are … many resources suitable for the non-statistician to help him?”

I would say, that I hope eventually there will be clinical statisticians who can actually think up and do this kind of Bayesian decision theory calculation, instead of calculating a lot of p values. A medical researcher should have access to them, but probably as of today doesn’t.

]]>If someone did an experiment with N=200 in both the group treated with statins and the control group and 10 people in each group reported mylagia, it would not change my beliefs about statins and myalgia very much.

Surely your beliefs about statis and myalgia are limited to referring to some “frame” or “population”, though.

No one is able to predict why some people get myalgia vs not right? So maybe such a study just found a good subgroup or something changed (some widespread dietary component, eg gluten, became unpopular). I don’t think that is any reason to ignore studies like that, especially when we are largely ignorant about why anything is going on in the system.

]]>That 2012 note advises a practitioner dealing with a patient with uncomfortable muscle aches in the arms to “Explain that myalgia with statins is common, affecting 5-10% of patients in clinical trials of statins.”

A flat prior regarding the extent to which treatment with statins induces myalgia is inappropriate—given all the evidence and experience that are available today. If someone did an experiment with N=200 in both the group treated with statins and the control group and 10 people in each group reported mylagia, it would not change my beliefs about statins and myalgia very much.

It seems to me that the right prior regarding the effects of statins has a big lump of probability near 10% with relatively little probability below 5% or above 20%.

Bob

]]>The clinician here has an intuition that the probability of the outcome of interest is greater in treatment compared with control group, despite the NHST being above a given threshold. But he doesn’t know how to express it.

Are there many resources suitable for the non-statistician to help him?

]]>Maybe then the best approach would be to openly state that this is a subject that requires far more time than is available, but to give you a sense of how fraught with danger it is, here are some quotes from Gelmen, Cohen, Meehl, and the ASA…

And then end with the classic “if you think you understand quantum mechanics, you don’t understand quantum mechanics”, with “quantum mechanics” crossed out and “p-values” written in.

That would be a pretty nihilistic take on “interpreting statistical evidence”, but perhaps appropriate given the time constraint.

I would NOT give an uncritical “here’s what p-values are in 15 minutes” talk. Everyone would come away either feeling confused, or even worse feeling like they understood it.

]]>1) The extraordinarily numerate (very few people)

2) Those completely disinterested in data (the storytelling folk), but even they fall victim to the stories being told by those who succumbed to temptation.

Gerd Gigerenzer describes a SIC Syndrome amongst doctors (Self defence, Innumeracy, Conflict of interests). The sad thing is that so often in medicine there is no data to illuminate a particular question and then when there is data, it simply becomes raw material for the SIC. ]]>