So. The other day this following email comes in, subject line “Grabbing headlines using poor statistical methods,” from Clifford Anderson-Bergman:
Here’s another to file under “How to get mainstream publication by butchering your statistics”.
The paper: Comparison of Hospital Mortality and Readmission Rates for Medicare Patients Treated by Male vs Female Physicians
Featured in: NPR, Fox News, Washington Post, Business Insider (I’m sure more, these are just the first few that show up in my Google News feed)
Adjusted mortality: 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233 Adjusted readmissions, 15.02% vs 15.57%; adjusted risk difference, –0.55%; 95% CI, –0.71% to –0.39%; P < .001; number needed to treat to prevent 1 readmission, 182 Statistical Folly: "We used a multivariable linear probability model (ie, fitting ordinary least-squares to binary outcomes) as our primary model for computational efficiency and because there were problems with complete or quasi-complete separation in logistic regression models." Regarding the number of regression parameters: Not explicitly listed, but by the following paragraph, I would suspect there are at least hundreds of regression parameters (such as an indicator of medical of school attended). "We accounted for patient characteristics, physician characteristics, and hospital fixed effects. Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year. Physician characteristics included physician age in 5-year increments (the oldest group was categorized as ≥70 years), indicator variables for the medical schools from which the physicians graduated, and type of medical training (ie, allopathic vs osteopathic29 training)." Also: setting aside the question of whether all of these effects have a strictly additive effect on probability (answer: no), it's not even clear that we want to condition on many of these unlisted physician characteristics, medical training, etc., if we want to talk about the causal difference of being treated by a male rather than female doctor. I don't have any idea about whether treatment from male or female doctors is better. But I do know that this paper gets us exactly 0 steps closer to answering that question.
And this from Scott Wong an hour later:
I’m an avid reader of your blog and it has changed the way I look at statistical analyses. I’ve started using Stan (python) multi-level models in my own work because of their ability to control/balance many factors at once.
A recent article caught my eye that uses null hypothesis significance testing to make drastic claims: “Evidence of the Superiority of Female Doctors: New research estimates that if all physicians were female, 32,000 fewer Americans would die every year.” The key result in the research study is that analysis of ~1.6MM hospitalizations revealed 30-day mortality of patients treated by female physicians was 11.07% vs 11.49% for male physicians. Similar patterns were found within smaller cohorts (reducing likelihood of Simpson’s paradox rearing its head).
The magnitude and confidence of their results is quite surprising. I didn’t identify any glaring errors in the research study, so I’m wondering if this is a garden of forking paths result or if the researchers are really on to something?
Other readers of your blog might be interested as well, so I was hoping you could discuss the paper there.
And then, 29 minutes later, this from Dean Eckles:
I thought you might find this example useful. It has some of your “favorite” things, and in an important setting.
This paper analyzes a lot of Medicare records with physician gender as the treatment, based on the idea from prior work that female physicians are “more likely to adhere to clinical guidelines and evidence-based practice”. The main analysis uses linear regression adjusting for a number of patient and physician characteristics and hospital fixed effects.
The cautious description of the result in the paper is “These findings suggest that the differences in practice patterns between male and female physicians, as suggested in previous studies, may have important clinical implications for patient outcomes.” The editorial comment in JAMA Internal Medicine titled “Equal Rights for Better Work?” more boldly says “These findings that female internists provide higher quality care for hospitalized patients yet are promoted, supported, and paid less than male peers in the academic setting should push us to create systems that promote equity in start-up packages, career advancement, and remuneration for all physicians.” And the press release from Harvard says:
“The difference in mortality rates surprised us,” said lead author Yusuke Tsugawa, research associate in the Department of Health Policy and Management. “The gender of the physician appears to be particularly significant for the sickest patients. These findings indicate that potential differences in practice patterns between male and female physicians may have important clinical implications.”
By the time we are with the press, we have:
– “Don’t want to die before your time? Get a female doctor” — USA TODAY
– “Evidence of the Superiority of Female Doctors: New research estimates that if all physicians were female, 32,000 fewer Americans would die every year” — The Atlantic
– “I’m assuming the difference is because of the way that women, in general, communicate. It’s about being better listeners, more nurturing and having emotional intelligence.” — NPR All Things Considered
As I commented on Twitter, this kind of analysis isn’t usually going to be very credible, but there are a few things that stick out. The authors only adjust for patient and physician age in 5-year buckets, so any within bucket selection of doctors gender remains a confounder. This is especially noteworthy because (a) being Medicare data, we are only talking about patients 65 and over, so each bucket is a pretty large fraction of the data and (b) there are substantial differences in the ages of male and female physicians. There is an alternative analysis in the supplement that uses adjustment for age as a continuous variable, though of course the way that is implemented here is that just replaces coarsening with a restriction to linearity.
Since this has:
– Observational causal inference
– Authors reporting being surprised by their results
– Adjustment for coarsened age, rather than non-parametric adjustment for age as observed
– Breathless coverage by the press
I thought you and your blog readers might find it interesting.
The authors do consider a subset of physicians who are hospitalists and call this a quasi-experiment, arguing that assignment to physician is then mainly determined by quasi-random shifts. But this also lacks some of the usual trappings of a credible analysis in that there isn’t much done to provide evidence for this assumption. And the effect size for this subpopulation also gets smaller with more controls.
And then the next day from Brandon Butcher:
I came across the following headline from Scientific American: https://www.scientificamerican.com/article/female-doctors-may-be-better-for-older-patients-health/
This seemed like something you might find interesting. Here’s the link to the original JAMA article: http://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2593255
A few quotes from the methods section I [Butcher] found concerning:
· Model 2 adjusted for all variables in model 1 plus hospital fixed effects (ie, hospital indicators), effectively comparing male and female physicians within the same hospital.
· We used a multivariable linear probability model30,31 (ie, fitting ordinary least-squares to binary outcomes) as our primary model for computational efficiency and because there were problems with complete or quasi-complete separation in logistic regression models.
Kinda funny that my first correspondent wrote about “butchering” the statistics, and my last correspondent’s name is Butcher. What’s that all about, huh?
In all seriousness, I don’t have to the time or energy to look at this one at all. But it does seem like an important topic so I thought I’d share it with you. You can read the paper yourself and make of it what you will.
P.S. A reader pointed me to a recent response to critics from Ashish Jha, one of the authors of the paper discussed above.
P.P.S. My correspondent writes:
Might the JAMA study fall into the category of The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time?
“Statistical significance is a lot less meaningful than traditionally assumed, for two reasons. First, abundant researcher degrees of freedom (Simmons, Nelson, and Simonsohn, 2011) and forking paths (Gelman and Loken, 2014) assure researchers a high probability of finding impressive p-values, even if all effects were zero and data were pure noise. Second, as discussed by Gelman and Carlin (2014), statistically significant comparisons systematically overestimate effect sizes (type M errors) and can have the wrong sign (type S errors).”
Ref: The statistical crisis in science: How is it relevant to clinical neuropsychology
My reply: Yes, this is possible. In some sense, perhaps “forking paths” should be our default assumption when judging such work. Rather than seeking evidence to support a claim of forking paths, perhaps we consider all p-values to be the result of forking paths (except in rare cases of preregistration) and we look for our inferences elsewhere. But I’m not quite ready to take that step in the context of a highly-publicized paper that I’ve been too busy to even read!
P.P.P.S. Also this highly critical assessment from William Briggs:
The NBC News story “Female Doctors Outperform Male Doctors, According to Study” makes these bold claims.
Patients treated by women are less likely to die of what ails them and less likely to have to come back to the hospital for more treatment, researchers reported Monday.
If all doctors performed as well as the female physicians in the study, it would save 32,000 lives every year, the team at the Harvard School of Public Health estimated.
Yet women doctors are paid less than men, on average, and less likely to be promoted.
“The data out there says that women physicians tend to be a little bit better at sticking to the evidence and doing the things that we know work better,” [Harvard’s Dr. Ashish Jha, who oversaw the study] told NBC News.
The ordinary reader would assume female doctors are always much better than male doctors, and the reason is (partly) because male doctors practice medicine regardless of what the evidence dictates. Worse, they receive greater rewards for their foolish and dangerous behavior.
The NBC story drew from paper “Comparison of Hospital Mortality and Readmission Rates for Medicare Patients Treated by Male vs Female Physicians” in the journal JAMA Internal Medicine by Tsugawa, Jena, and Figueroa. Its main claim is this:
Using a national sample of hospitalized Medicare beneficiaries, we found that patients who receive care from female general internists have lower 30-day mortality and readmission rates than do patients cared for by male internists. These findings suggest that the differences in practice patterns between male and female physicians, as suggested in previous studies, may have important clinical implications for patient outcomes.
Now those “suggests” in the second sentence should set alarm bells ringing. And, indeed, Tsugawa and his co-authors did not measure how doctors practiced, and so even if it were true that male and female physicians had different 30-day mortality and readmission rates, the researchers would have no way of knowing why the differences existed. And neither would NBC.
Let’s Examine the Numbers
What happened was this. The authors collected a sample of about a million-and-a-half “Medicare fee-for-service beneficiaries 65 years or older who were hospitalized in acute care hospitals.” Mean age of patients was about 80. The NBC summary misleads by saying just “patients,” which implies the research applies to everybody and not just elderly Medicare patients. . . .
Are there other possible explanations to account for the small differences noted by the models? Yes. Female docs were about 5 years younger on average, and female docs also treated many fewer patients on average than men. This implies women docs had more time per patient.
Even more intriguing, we also know “female physicians treated slightly higher proportions of female patients than male physicians did.” And since women live longer than men, particularly at those advanced ages, maybe — just maybe! — any slight change in mortality or readmission rates between male and female docs could be explained by women doctors treating more longer-lived patients.
That explanation is surely as or more plausible than results from an unnecessarily complicated statistical model. It also eliminates the unwarranted theorizing about how women physicians are “better at sticking to the evidence” and are thus “underpaid.”