Will Moir writes:

This short New York Times article on a study published in BMJ might be of interest to you and your blog community, both in terms of how the media reports science and also the use of bayesian vs frequentist statistics in the study itself.

Here is the short summary from the news ticker thing on the NYTimes homepage:

Wow, that sounds really bad! Here is the full article:

https://www.nytimes.com/2017/05/09/well/live/pain-relievers-tied-to-immediate-heart-risks.htmlIt is extremely short, and basically just summarizes the abstract, adds that the absolute increase in risk is actually very small, and recommends talking to your doctor before taking NSAIDs. I guess my problem is that they have the scary headline (53%!), but then say the risk is actually small and you might or might not want to avoid NSAIDs. So is this important or not? The average reader probably has not thought much about relative versus absolute risk, so I wish they would have expanded on that.

In terms of bayesian vs frequentist, this study is bayesian (bayesian meta-analysis of individual patient data). Here is the link:

http://www.bmj.com/content/357/bmj.j1909

Despite being bayesian, the way the results are presented give me very frequentist/NHST vibes. For example, the NYTimes article gives the percent increase in risk of heart attack for the various NSAIDs, which are taken directly from the odds ratios in the abstract:

With use for one to seven days the probability of increased myocardial infarction risk (posterior probability of odds ratio >1.0) was 92% for celecoxib, 97% for ibuprofen, and 99% for diclofenac, naproxen, and rofecoxib. The corresponding odds ratios (95% credible intervals) were 1.24 (0.91 to 1.82) for celecoxib, 1.48 (1.00 to 2.26) for ibuprofen, 1.50 (1.06 to 2.04) for diclofenac, 1.53 (1.07 to 2.33) for naproxen, and 1.58 (1.07 to 2.17) for rofecoxib.

This reads to me like the bayesian equivalent of “statistically significant, p<0.05, lower 95% CI is greater than 1”! To be fair that is just the abstract, and the article itself provides much, much more information.

The following passage also caught my eye:

The bayesian approach is useful for decision making. Take, for example, the summary odds ratio of acute myocardial infarction of 2.65 (1.46 to 4.67) with rofecoxib >25 mg/day for 8-30 days versus non-use. With a frequentist confidence interval, which represents uncertainty through repetition of the experience, all odds ratios from 1.46 to 4.67 might seem equally likely. In contrast, the bayesian approach, although resulting in a numerically similar 95% credible interval, also allows us to calculate that there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.

It seems like they’re using bayesian methods to generate alternative versions of the typical frequentist statistics that can actually be interpreted the way most people incorrectly interpret frequentist/NHST stats (p=0.01 meaning 99% probability that there is an effect, etc). If so that is great because it makes sense to use statistics that match how people will interpret them anyway, but I also imagine it also would be subject to the same limitations and abuse that is common to NHST (I am not saying that about this particular study, just in general).

I agree. If you’re doing decision analysis, you can’t do much with statements such as, “there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.” It’s better to just work with the risk parameter directly. A parameter being greater than 2.00 isn’t what kills you.

I didn’t read the article, but I assume that from a clinical point of view, the relevant outcome is myocardial infarction. The clinician, faced with a new patient and a set of medication options, including nothing at all, has to predict the chances of myocardial infarction under each treatment option.

Why not get the posterior predictive distribution of myocardial infarction under each option, and report that? Isn’t that what clinicians (and drug manufacturers) really care about?

(My guess is that the predictive distributions will be effectively indistinguishable among options. Just a guess.)

When you are presented with a new patient, the posterior predictive distributions of MI under each treatment may be too wide making it difficult to compare them. However, what you really would like to get is a prediction for the effect of the treatment (compared to the baseline). The relative risk is assumed to be independent from the baseline level, but every model has its limitations.

The distribution of the risk of having a traffic accident might span several orders of magnitude depending on the person, but if driving after eating cucumbers a person was twice as likely to have an accident you would probably find that interesting.

And do you really want to report the full functional form of the posterior distribution? What do you expect the clinician to do with this curve? Reporting a summary doesn’t seem a bad idea. And given the summaries reported, what do you mean by “effectively indistinguishable”?

I think reporting a summary is appropriate in this kind of case, but I’d like to see something real world as Andrew says, like “expected QALY loss as a function of current age” plotted as a curve, rather than “range of credible odds ratios”.

If the expected QALY loss is less than 1 day for all people over 18 for example, we could all rest assured… but to the academic industry this would feel like “we got a boring result” and not produce grants and publications! Of course, to people in pain, this would be a really really interesting result… because they could keep taking their pain meds!

Also, did they not examine Aspirin which is an NSAID and is given at low doses *to reduce heart attack risks*?

I think Andrew would prefer a credible interval of QALY loss, rather than a point estimate.

Sure, plot the expected QALY loss as a function of age as a dark line with a 50% credible interval around it as grey, and a 95% credible interval as very light grey. That’d be a great summary.

Actually, let’s plot it as QALY gain, so that negative numbers are losses…. Because really there’s probably a QALY gain given that people prefer not to be in pain, and people who need weeks of NSAIDS are in pain.

I like the general direction (focussing on summary health outcome). But from a decision-theoretic perspective… why should we care about the interval?

If the interval is telling you about the range of outcomes the population can expect (as opposed to say the estimation error) then it helps you to see how much of a gamble things are. Suppose you’re talking about average QALY gain of 1 day, with an expected population variation of -300 days to +125 days, that’s kind of very different than -2 to +4 with an average of 1. You might ask yourself something like “is there something special about me that puts me in the category closer to the -300 people ? or closer to the 125 ?

They seem to have the data since, aspirin use is now treated as a confounder:

I’d guess that aspirin is already known to be cardioprotective so there is no reason to report on those results.

They were forced away from non-informative priors for the between study variance but not other parameters.

Doubt if it offers anything over a classical likelihood analysis.

This [claiming an uniformed Bayesian analysis offers much more] has been going on for a long time as I wrote in 2003/4 – “It [Bayesian approach] offers a direct means to both specify what are reasonable values of the parameters (in terms of prior probability distributions) and combine this ‘probability distribution based’ information, as we saw above, with the sample based likelihood information using Bayes theorem. Unfortunately, we do not see many serious attempts to carry this out in meta-analysis. ” Bayesian random effects meta-analysis of trials … http://onlinelibrary.wiley.com/doi/10.1002/sim.2115/epdf

Things seem to change very, very slowly in applied statistics…

Meanwhile, in the Washington Post this morning:

“The studies we do have suggest that health insurance does save some lives; the Annals of Internal Medicine published a meta-analysis this month concluding that the odds of dying among the insured relative to the uninsured is 0.71 to 0.97.”

Last time I checked, the probability of dying among both insured and uninsured was identically equal to 1 (excluding immortals from our sample who have a probability of dying identically equal to 0). So the odds in both cases are undefined and the odds ratio is positive infinity over positive infinity. Somehow that ratio has a confidence interval of 0.71 to 0.97.

Dalton:

I’ve not seen the study in question but my guess is that it’s a hazard model that estimates the effect of health insurance on the instantaneous probability of death. It’s possible for a treatment to lower the instantaneous probability of death by 20%, say, so that you’d say the instantaneous probability of dying among the insured relative to the uninsured is 0.80. Using standard life tables, you can compute the expected additional years of life if your probability of dying is reduced by 20% at all ages.

Yet another example of how people ignore dimensional analysis. The appropriate wording in that case would be “relative probability of dying per unit time”

Because when you try to convey the information to low-numeracy people who will try to convey the information to very low-numeracy people, the “per unit time”, or anything with a hint of nuance, they get grumpy and not listen to you. I’m not talking about a university with a PR department, but instead a situation where the medium/low-numeracy people run the show, and you need to distill things to what they understand. It’s actually very frustrating.

With all that said, frankly Bayesian results (e.g., posterior intervals) more specifically answer everyone’s questions. Problem is people with some but not enough statistical understanding declare “minimum sample sizes” and “p>=0.05 implies zero effect” are those in charge of what is said quantitatively.

The world is not ready for your Existentialist Epidemiology.

I also think that these studies, whilst useful, would add a lot more value by considering the clinical model underlying the comparisons. Numbers without full context make interesting articles, but without real-world meaning often. Statistically, consider some of the more relevant predictors, at multiple levels, in addition to just the single medication. In this instance, people take these medications for pain. Pain generally indicates a clinical condition. Clinical conditions that require pain relief, typically do not only use one approach. For example, chronic pain, CRPS, neuropathies, etc, typically have a number of opiate agonists, as primary relief with NSAIDs as adjunctive. The point is that the reason for taking pain relief is highly suggestive of a clinical condition that impairs functioning and that interactions are inherent. These conditions range from cancers, post-operative pain, injury-related immobility, orthopaedic pain, diabetic neuropathy, cardiac causes etc. Given that cancer and cardiac origins of pain account for significant levels of mortality, it is isn’t so surprising that pain relief for these conditions might be associated with higher mortality overall for people taking pain relief. However, that is of course not possible using the above data or methodology. So what would be a Bayesian way of introducing the uncertainty of factors sitting behind what is being examined? A hyperparameter? But how to interpret that…

Clinical condition -> Increased mortality from the condition -> pain relief -> NSAIDs as part of meds -> increased mortality from NSAIDS????

You want them to do science? (Sarcasm)

Seriously though, that’s a great point, and yes it can be modeled in a bayesian model, and you’ll probably find identifiability issues, and then you can look at your model and see what data might help you identify the different contributions. Just to give you an idea of how that might work (and note, this is Glen’s favorite single-subject analysis):

Just consider one illness for simplicity of exposition.

Consider the probability of death as a function of time P_i(t) for each person, consider a differential equation for this:

dP/dt = (1-P(t))^q * (BaselineRate(age(t)) + k*Severity_of_illness(t) + l*dosage_of_pain_drug(t))

dosage_of_pain_drug(t) = some_function(Severity_of_illness(t)) // this could be a differential equation too…

get BaselineRate as a given function from life tables averaged over the whole population or from a control group with no significant illnesses.

let k,l be individual coefficients per person with a common hyperdistribution.

Now, you need a group of people who either refuse to use the pain drug, or you take them off the pain drug for a while, or assign them randomly to switch between pain drugs through time (and modify your model)… So that you can potentially identify at least the relative risks from different pain drugs. Whether placebo makes any sense here is a question, I mean can you give these people starch and let them suffer pain? It’s an ethical issue)

observe a big population for long enough that you have some baseline expectation of deaths just from the background risk…

Analyze the data through time to get estimates of the individual, and population hyperparameters for death risk…

Done right with the right design, you can make this kind of model extremely informative, and identified.

Thanks — one of the big picture issues in these type of studies is exactly that they don’t consider life tables or that mortality is a DE. The same applies when one simply models a survival curve against a forcing factor on that curve. I did some work for a colleague a while ago modeling how to match an index of disease (Charlson Score was the example we used) against life tables and what the imposition of a disease did to mortality. A computational model, it seemed, is not that responsive to small things like whether one used an NSAID or not, if the models http://bit.ly/2sjvHLB we played with are any indication. It takes a fairly significant level of change in mortality to drag a survival curve in any direction (which makes sense). It just seems ‘right’ that these models must be Bayesian in nature.

One issue is that for people who actually die during the study the far boundary is known, so one thing you could do is solve the ODE by spectral method, with the boundary condition automatically matched (ie. 100% chance of death at the actual observed date of death for those who die). For those who don’t die during the study obviously the boundary is free, but hierarchically modeled so that you pool information across the multiple people you recruit. You’d want a wide range of initial ages, that would help you understand the role of age.

Hmm… Now I want to do a simulation study.

One thing we found in simulating mortality is that (consistent with clinical research) it is far better to use the 50+ or even 65+ groups. Mathematically, this is where the derivative is greatest on the sigmoid curve and so the (small) impact of variables is most likely to have an impact. Very old and young are just plainly less useful and clog up simulations with small effects! I guess in engineering terms, those age ranges are when parts should be replaced (and, medically, when indeed this occurs). Having said that, there also are anomalies, such as diabetes. In that instance, some of the research suggests that when young people engage in a lifestyle leading to Type 2, their lifespan shortens to a fixed number of years much as if onset was in the more typical later years a couple of generations ago. So, referring back to the above study on painkillers, where these people are then given analgesia, it hardly seems surprising that they would also die from an MI or something else within a fixed number of years. I do wonder whether the provision of analgesia for even a short time is not any more than just an indicator variable for a greater hazard ratio for that person from some other source. Even using the above paradigm in the BMJ study, additionally coding for Charlson (https://www.mdcalc.com/charlson-comorbidity-index-cci) predictors of increased mortality, modeling this as a parameter, and then examining relative contribution of NSAIDS would be a useful next step. The NSAID study (and its ilk) drives clinicians nuts as it really does not help inform practice in any real sense. I mean, would you then say to someone in pain “I can’t give you painkillers as they might increase your chance of an MI”. Who would ever prescribe them, let alone take them?! Well, IMHO anyway.

Yes, older groups makes sense. NSAIDs are unlikely to kill 20 yo. But you would want to look in range 40 to 80 instead of say recruiting only 60 to 65 for example. Part of the reason is that the simple model I wrote down is wrong. There needs to be interaction with the natural baseline as well. Whatever makes you more likely to die also potentially makes drugs more toxic etc.

As is this study seems totally clinically unhelpful and designed for tenure and grants.

this is also what i do in my papers. it’s a compromise. i would never be able to publish bayesian models otherwise.

As someone working a related area, it’s good to see these methods being used. It’s easy to criticise and say they should have done X, Y or Z (which may indeed have made a more clinically useful/scientifically better study) (but also in some cases a bigger and more complicated study) but as the original correspondent says “it makes sense to use statistics that match how people will interpret them anyway.” That’s an important step forward.

Next thing I want to see is for people not to feel they need to put “bayesian” in the title! [and “bayesian” with lower case b? Is that accepted usage? seems like a nice thing, I can’t explain why]

Simon:

I really don’t get this – or maybe I do.

If you are going to accept non-informative priors as relevant and reasonable (and so can accept posterior probabilities as sensible) then you can just take the confidence intervals as being credible intervals and you can skip all the extra work doing Bayes.

> people not to feel they need to put “bayesian” in the title!

But they have done all that extra work and the need to get extra credit for it!

My sense of what’e happening here, which started almost 20 years ago, was that folks that did meta-analysis of clinical trials in Revman (simple inverse variance closed formula methods as per Cochrane Collaboration) discovered it was not that hard to run WinBugs with default non-informative priors.

Perhaps at first they were surprised to get essentially the same intervals but without much extra work they could now claim to have done much more and _get away with it_. One of the first, I think, was published in NEJM (late 1990’s?) with an editorial calling attention to it being _BAYESIAN_.

> It’s easy to criticise and say they should have done X, Y or Z

Unfortunately to realistically deal with uncertainties and confounding meaningfully better with Bayesian methods takes hard work and even harder thought. It should be worth it but most statisticians are just taking the easy route – quickly pulling the Bayesian crank (with default priors) and claiming to have accomplished something important.

Shravan’s comment above is sad but true – more sensible Bayesian analyses likely will be hard to publish. Certainly not worth it if you want to progress in your career ASAP.

p.s. I was banned from the Cochrane Collaboration’s Statistical Methods Group’s email list for making overly provocative comments like these.

Keith – thanks for reply. I don’t really disagree with anything you say – I guess I feel that we’re starting from such a low baseline that getting anything more sensible done feels like a win. I think that Bayesian methods do get you something extra even if you do use default non-informative priors (not that I’m defending that practice) – for example, statistics that mean something that people want to know (and can understand), and the ability to answer other relevant questions, like probability that the differences are clinically important and so on. Maybe not perfect but a step forwards maybe? As you said, change happens slowly – sounds like you can at least feel good about being so far ahead of the game!!

> you can at least feel good about being so far ahead of the game!!

Actually its not a good position to be in – except perhaps just a bit ahead.

> I think that Bayesian methods do get you something extra even if you do use default non-informative priors

In most research settings, we will just have to agree to disagree – the simulations under default non-informative priors here, I believe, make it clear why http://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/ – its the habits of interpretations of results that the methods enable and encourage in clinical researchers that really matters.

Also, for questions the study was trying to address – I think there are more promising things for statisticians to work on as being pointed to here https://ww2.amstat.org/meetings/jsm/2017/onlineprogram/AbstractDetails.cfm?abstractid=325035

Keith:

” Actually its not a good position to be in – except perhaps just a bit ahead.”

I really think you’re right. Cassandra syndrome I call it. You know exactly what is wrong with some stuff, but society just treats you as crazy or boots you off the mailing list, or marches into Iraq looking for WMD or whatever. It’s actually common in Engineering. Think of all those guys who knew damn well decades before it did happen, exactly what would happen when a hurricane like Katrina hit New Orleans? etc etc.

“What is the greatest wonder of the world? That no-one, although he sees others dying all around him, believes that he (sorry!) himself will die.” Mahabharratah. Being human means engaging in self-deception; the challenge of statistics is the ‘challenge’ bit, not the statistics themselves.

Keith:

This

“If you are going to accept non-informative priors as relevant and reasonable … then you can just take the confidence intervals as being credible intervals”

came up in a thread just a few days ago. Anoneuoid asked for a reference to a good formal statement of this. Can you recommend one? (I went looking for a good one and ended up at Larry Wasserman’s blog, linked to in the thread, but while it had a nice informal discussion there wasn’t a formal statement.)

Mark: I was being informal – general experience has been that the intervals for pooled effects in these sort of meta-analysis are very similar.

If one wanted to be sure, probably easiest to do the Bayesian analysis and check.

As for formal, there has been a lot done on ensuring what priors will produce credible intervals for what parameters that are bona fide confidence intervals e.g. DAS Fraser https://projecteuclid.org/euclid.ss/1484816587 . But that is pretty much involves one to one functions i.e. a joint model, data and a given method for constructing credible intervals that always results in the same interval (disregarding MCMC sampling error) so not unmanageable.

To go the other way, which of the many bona fide confidence intervals given the same data that could be constructed will be matched by a joint model (prior and data model) that has a some version of a non-informative prior, same data and a given method for constructing credible intervals – you have an m to n function (or m to 1 if the non-informative joint model and given method for constructing credible intervals is fixed.)

If Larry chose not to be formal, I certainly am not ;-)

*if* a confidence interval is constructed from a likelihood, then because a confidence procedure must by definition meet the coverage constraint independent of what the parameter value actually was, it needs to be symmetric in the prior for every value. We had this discussion in the comments a while back, Carlos Ungil, who keeps me honest, pointed out that I was misinterpreting what a confidence procedure has to do.

You will find that the actual precise definition of a confidence procedure is rarely discussed in typical stats textbooks. I have two or three bog standard stats books (non Bayesian) and couldn’t find it stated precisely, usually just an informal description and lots of examples of calculating standard procedures. Jerzy Neyman apparently was the one who came up with the definition as part of the founding of confidence theory in around 1941. Unfortunately I can’t quote you the definition, I don’t have his book.

In particular consider the following two options, we have data that really does come out of a high quality computer RNG (so no distribution assumption violations), the value put in to the RNG for the mean parameter was X a number chosen by someone else and not told to you, but a definite constant number.

A “reliable” procedure (my terminology) could be one where you collect data D, and construct an interval I using p(Data | X) p(X) with real Bayesian prior on X, and it’s “95% reliable” (again my made up term) if 95% of the time that you get a new dataset, the interval you construct using this procedure contains the true value of X.

That apparently is NOT a confidence procedure. Why? Because Jerzy Neyman didn’t want confidence procedures to depend on your choice of a prior or some such thing. He said, it’s a confidence procedure if the procedure works for EVERY POSSIBLE X, not just whatever the X you actually have in your particular problem.

since it has to work for every possible X, this is a symmetry property that implies p(X) is proportional to a constant (ie. it’s a nonstandard uniform prior on [-N,N] for N a nonstandard number, aka an improper prior, flat prior, etc)

Andrew pointed out in comments a few days ago, that essentially sometimes you can constrain the conceivable range to a finite interval, like for a probability, it only makes sense to be in the interval [0,1], in which case the confidence procedure has prior that is just flat on that finite interval and is therefore proper.

But, not all confidence procedures are necessarily made from likelihoods!

So, when there’s a correspondence between CI and Bayes, the correspondence is to a constant prior precisely because the CI procedure is a CI procedure only when it works for all parameter values.

This leads to the perverse situation where when people use Bayesian procedures that have even mildly correct priors (puts the actual value not too deep into the tail) they get better coverage because their prior up-weights the region of parameter space near where the actual value is.

You could imagine the following two meta-procedures:

1) Think up a scientific question, go out collect data and form a confidence interval using standard frequentist CI procedure

2) Think up a scientific question, ask someone with a mild amount of experience in this field for the order of magnitude of the parameter, make a prior that covers those values, collect data and make a Bayesian HPD interval.

You will find that (2) is more often right than (1) as an empirical fact. Sadly, (2) may be WAY WAY more often right than (1) in some areas of science as an empirical fact, particularly when the real world data generating process is messy and has lots of nuisance parameters, like measurement error of instruments and varying biases from place to place, and soforth:

Here’s some comments from Andrew about less than nominal coverage in physics that I got by googling: http://andrewgelman.com/2014/03/15/problematic-interpretations-confidence-intervals/#comment-155456

Here’s a link to the Henrion and Fischoff paper on coverage in physics: https://www.cmu.edu/dietrich/sds/docs/fischhoff/Henrion-Fischhoff-AmJPhysics-86.pdf

What you’ll see is that the confidence intervals have atrocious coverage. Between 1870 and 1940 it looks like there were 13 intervals collected, and 6 of them had coverage 6/13 ~ 50%

Here’s just one example study of bootstrap based intervals from real world income data:

https://www.r-bloggers.com/actual-coverage-of-confidence-intervals-for-standard-deviation/

To get 95% coverage in the bootstrap interval requires a sample size of upwards of 20,000 data points for this real world data. So I guess by definition, the bootstrap isn’t a confidence procedure? Unless you have 20k observations in which case it is? Bootstrap is widely taken to be a distribution-free method of constructing confidence intervals, but I guess it only has claimed coverage asymptotically. So in real world data with moderate sample sizes, again the Bayesian procedure of putting in some prior information seems to more often gets things right.

Finally we have the following that I wrote a couple days ago: http://models.street-artists.org/2017/06/27/on-morality-of-real-world-decisions-and-frequentist-principles/

When there are real world consequences to your actions post-inference, such as clinical studies, then it makes a difference in people’s lives whether you do the irrelevant thing of trying to get the parameter correct, or the relevant thing of trying to get the *consequences* correct.

When you want to get the consequences correct (which is essentially all the time in clinical and engineering and policy decision making etc) it seems very very likely, short of some bizarre technical corner cases that I don’t think really are of interest, like where your loss function is discontinuous everywhere (the kind of think Mathematicians love to think about), that the Bayesian method for evaluating the consequences is the thing that has best Frequentist properties (ie. it does the right thing more often than Neyman’s confidence intervals etc). The idealized version of this statement is Wald’s theorem. But empirically the Henrion-Fischhoff paper gives you the idea

Formally then, you need to find Neyman’s book/papers where he defines a confidence interval as a procedure that works for all values of X, and this then implies the symmetry property that requires the prior to be flat.

Thanks Keith, thanks Daniel. I also went back to the Larry Wasserman blog entry (link reproduced here: https://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-priors/) and started reading the paper on noninformative priors cited there, (Kass-Wasserman JASA 1996, http://www.stat.cmu.edu/~fienberg/Statistics36-756/KassWasserman-JASA-1996.pdf). Quite a rabbit hole … and very interesting … but lots of different results cited with variations on various themes, unlike the Bernstein-von Mises Theorem (which seems to be a singular result with an intuition I can get my head around … though I could be kidding myself here).

Anyway, thanks again.

“…also allows us to calculate that there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.”

Many Bayesians seem to be very keen on stating that this kind of thing is a big advantage and exactly what the practitioner wants to know, but for me this is pretty meaningless as long as there is no attempt to convince me that any other prior that can be argued equally well as reasonable can’t give very different results such as 56% or 24% or whatever.

(OK, intelligent people here have some good ideas which other kinds of posterior probability practitioners would rather want to know but the issue applies to those, too.)