Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Clinical trials that are designed to fail

Mark Palko points us to a recent update by Robert Yeh et al. of the famous randomized parachute-jumping trial:

Palko writes:

I also love the way they dot all the i’s and cross all the t’s. The whole thing is played absolutely straight.

I recently came across another (not meant as satire) study where the raw data was complete crap but the authors had this ridiculously detailed methods section, as if throwing in a graduate level stats course worth of terminology would somehow spin this shitty straw into gold.

Yeh et al. conclude:

This reminded me of my zombies paper. I forwarded the discussion to Kaiser Fung, who wrote:

Another recent example from Covid is this Scottish study. They did so much to the data that it is impossible for any reader to judge whether they did the right things or not. The data are all locked down for “privacy.”

Getting back to the original topic, Joseph Delaney had some thoughts:

I think the parachute study makes a good and widely misunderstood point. Our randomized controlled trial infrastructure is designed for the drug development world, where there is a huge (literally life altering) benefit to proving the efficacy of a new agent. Conservative errors are being cautious and nobody seriously considers a trial designed to fail as a plausible scenario.

But you see new issues with trials designed to find side effects (e.g., RECORD has a lot more LTFU than I saw in a drug study, when I did trials we studied how to improve adherence to improve the results—but a trial looking for side effects that cost the company money would do the reverse). We teach in pharmacy that conservative design is actually a problem in safety trials.

Even worse are trials which are aliased with a political agenda. It’s easy-peasy to design a trial to fail (the parachute trial was jumping from a height of 2 feet). That makes me a lot more critical when you see trials where the failure of the trial would be seen as a upside, because it is just so easy to botch a trial. Designing good trials is very hard (smarter people than I spend entire careers doing a handful of them). It’s a tough issue.

Lots to chew on here.

“Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

Jonathan Falk came across this article and writes:

Is there any possible weaker conclusion than “providing caloric information may help some adults with food decisions”?

Is there any possible dataset which would contradict that conclusion?

On one hand, gotta give the authors credit for not hyping or overclaiming. On the other hand, yeah, the statement, “providing caloric information may help some adults with food decisions,” is so weak as to be essentially empty. I wonder whether part of the problem here is the convention that the abstract is supposed to conclude with some general statement, something more than just, “That’s what we found in our data.”

Still and all, this doesn’t reach the level of the classic “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

Lancet-bashing!

Retraction Watch points to this fun article by Ashley Rindsberg, “The Lancet was made for political activism,” subtitled, For 200 years, it has thrived on melodrama and scandal.

And they didn’t even mention Surgisphere (for more detail, see here) or this story (the PACE study) or this one about gun control.

All journals publish bad papers; we notice Lancet’s more because they get more publicity.

Mister P and Stan go to Bangladesh . . .

Prabhat Barnwal, Yuling Yao, Yiqian Wang, Nishat Akter Juy, Shabib Raihan, Mohammad Ashraful Haque, and Alexander van Geen ask,

Is the low COVID-19–related mortality reported in Bangladesh for 2020 associated with massive undercounting?

Here’s what they did:

This repeated survey study is based on an in-person census followed by 2 rounds of telephone calls. Data were collected from a sample of 135 villages within a densely populated 350-km2 rural area of Bangladesh. Household data were obtained first in person and subsequently over the telephone. For the analysis, mortality data were stratified by month, age, sex, and household education. Mortality rates were modeled by bayesian multilevel regression, and the strata were aggregated to the population by poststratification. Data analysis was performed from February to April 2021. . . .

Mortality rates were compared for 2019 and 2020, both without adjustment and after adjustment for nonresponse and differences in demographic variables between surveys. Income and food availability reported for January, May, and November 2020 were also compared.

And here’s what they found:

All-cause mortality in the surveyed are was lower in 2020 compared with 2019, but measures to control the COVID-19 pandemic were associated with a reduction in rural income and food availability. These findings suggest that government restrictions designed to curb the spread of COVID-19 may have been effective in 2020 but needed to be accompanied by expanded welfare support.

More specifically:

Enumerators collected data from an initial 16 054 households in January 2020 . . . for a total of 58 806 individuals . . . A total of 276 deaths were reported between February and the end of October 2020 for the subset of the population that could be contacted twice over the telephone, slightly below the 289 deaths reported for the same population over the same period in 2019. After adjustment for survey nonresponse and poststratification, 2020 mortality changed by −8% (95% CI, −21% to 7%) compared with an annualized mortality of 6.1 deaths per 1000 individuals in 2019. However, in May 2020, salaried primary income earners reported a 40% decrease in monthly income (from 17 485 to 10 835 Bangladeshi Taka), and self-employed earners reported a 60% decrease in monthly income (23 083 to 8521 Bangladeshi Taka), with only a small recovery observed by November 2020.

I’ve worked with Lex and Yuling for a long time, and they both know what they’re doing.

Beyond the direct relevance of this work, the above-linked article is a great example of applied statistical analysis with multilevel regression and poststratification using Stan.

The appeal of New York Times columnist David Brooks . . . Yeah, I know this all sounds like a nutty “it’s wheels within wheels, man” sort of argument, but I’m serious here!

Over the years, we’ve written a bit about David Brooks on this blog, originally because he had interesting things to say about a topic I care about (Red State Blue State) and later because people pointed out to me various places where he made errors and then refused to correct them, something that bothered me for its own sake (correctable errors in the paper of record!) and as part of a larger phenomenon which I described as Never back down: The culture of poverty and the culture of journalism. At an intellectual level, I understand why pundits are motivated to not ever admit error, also I can see how they can get into the habit of shunting criticism aside because they get so much of it; still, I get annoyed.

Another question arises, though, which is how is it that Brooks has kept his job for so long? I had a recent discussion with Palko on this point.

The direct answer to why Brooks stays employed is that he’s a good writer, regularly turns in his columns on time, continues to write on relevant topics, and often has interesting ideas. Sure, he makes occasional mistakes, but (a) everyone makes mistakes, and when they appear in a newspaper with a circulation of millions, people will catch these mistakes, and (b) newspapers in general, and the Times in particular, are notorious for only very rarely running corrections, so Brooks making big mistakes and not correcting himself is not any kind of disqualification.

In addition, Palko wrote:

For the target audience [of the Times, Brooks offers] a nearly ideal message. It perfectly balances liberal guilt with a sense of class superiority.

I replied with skepticism of Palko’s argument that Brooks’s continued employment comes from his appeal to liberals.

I suspect that more of it is the opposite, that Brooks is popular among conservatives because he’s a conservative who conservatives think can appeal to liberals.

Kinda like the appeal of Michael Moore to liberals: Moore’s the sort of liberal who liberals think can appeal to conservatives.

I like this particular analogy partly because I imagine that it would piss off both Brooks and Moore (not that either of them will ever see this post).

Palko responded:

But it’s not conservatives who keep hiring him.

Brooks’ breakthrough was in the Atlantic, the primary foundation of his career is his long-time day job is with the NYT, his largest audience probably comes from PBS News Hour.

To which I replied as follows:

First off, I don’t know whether the people who are hiring Brooks are liberal, conservative, or somewhere in between. In any case, if they’re conservative, I’m pretty sure they’re only moderately so: I say this because I don’t think the NYT op-ed page has any columnists who supported the Jan 6 insurrection or who claim that Trump actually won the 2020 election etc.

It’s my impression that one reason Brooks was hired, in addition to his ability to turn in readable columns on time, was (a) he’s had some good ideas that have received a lot of attention (for example, the whole bobo stuff, his red-state, blue-state stuff), and (b) most of their op-ed columnists have been liberal or centrist, and they want some conservatives for balance.

Regarding (a), yes, he’s said a lot of dumb things, but I’d say he still has had some good ideas. He’s kinda like Gladwell in that he speculates with an inappropriate air of authority, but his confidence can sometimes get him to interesting places that a more careful writer might never reach.

Regarding (b), it’s relevant that many conservatives are fans of Brooks (for example here, here, and here). If the NYT is going to hire a conservative writer for balance, they’ll want to hire a conservative writer who conservatives like. Were they to hire a writer who conservatives hate, they wouldn’t be doing a good job of satisfying their goal of balance.

So, whoever is in charge of hiring Brooks and wherever his largest audience is, I think that a key to his continued employment is that he is popular among conservatives because he’s a conservative who conservatives think can appeal to liberals.

Yeah, I know this all sounds like a nutty “it’s wheels within wheels, man” sort of argument, but I’m serious here!

This post is political science

The point of posting this is not to talk more about Brooks—if you’re interested in him, you can read his column every week—but rather to consider some of these indirect relationships here, the idea that a publication with liberal columnists will hire a conservative who is then chosen in large part because conservatives see him as the sort of conservative who will appeal to liberals. I don’t think this happens so much in the opposite direction, because if a publication has lots of conservative columnists, that’s probably because it’s an explicitly conservative publication so they wouldn’t want to employ any liberals at all. There must be some counterexamples to that, though.

And I do think there’s some political science content here, related to this discussion I wrote with Gross and Shalizi, but I’ve struggled with how to address the topic more systematically.

John Mandrola’s tips for assessing medical evidence

Ben Recht writes:

I’m a fan of physician-blogger John Mandrola. He had a nice response to your blog, using it as a jumping-off point for a short tutorial on his rather conservative approach to medical evidence assessment.

John is always even-tempered and constructive, and I thought you might enjoy this piece as an “extended blog comment.” I think he does a decent job answering the question at hand, and his approach to medical evidence appraisal is one I more or less endorse.

My post in question was called, How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer, and I concluded with this bit of helplessness: “I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature. I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.”

In his response, “Simple Rules to Understand Medical Claims,” Mandrola offers some tips:

The most important priors when it comes to medical claims are simple: most things don’t work. Most simple answer answers are wrong. Humans are complex. Diseases are complex. Single causes of complex diseases like cancer should be approached with great skepticism.

One of the studies sent to Gelman was a small trial finding that Vitamin D effectively treated COVID-19. The single-center open-label study enrolled 76 patients in early 2020. Even if this were the only study available, the evidence is not strong enough to move our prior beliefs that most simple things (like a Vitamin D tablet) do not work.

The next step is a simple search—which reveals two large randomized controlled trials of Vitamin D treatment for COVID-19, one published in JAMA and the other in the BMJ. Both were null.

You can use the same strategy for evaluating the claim that fish oil supplementation leads to higher rates of prostate cancer.

Start with prior beliefs. How is it possible that one exposure increases the rate of a disease that mostly affects older men? Answer: it’s not very possible. . . .

Now consider the claims linked in Gelman’s email.

– Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial

– Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial

While both studies stemmed from randomized trials neither were primary analyses. These were association studies using data from the main trial, and therefore, we should be cautious in making causal claims.

Now go to Google. This reveals two large randomized controlled trials of fish oil vs placebo therapy.

– The ASCEND trial of n-3 fatty acids in 15k patients with diabetes found “no significant between-group differences in the incidence of fatal or nonfatal cancer either overall or at any particular body site.” And I would add no difference in all-cause death.

– The VITAL trial included cancer as a primary endpoint. More than 25k patients were randomized. The conclusions: “Supplementation with n−3 fatty acids did not result in a lower incidence of major cardiovascular events or cancer than placebo.”

Mandrola concludes:

I am not arguing that every claim is simple. My case is that the evaluation process is slightly less daunting than Professor Gelman seems to infer.

Of course, medical science can be complicated. Content expertise can be important. . . .

But that does not mean we should take the attitude: “I have no idea what to think about these papers.”

I offer five basic rules of thumb that help in understanding medical claims:

1. Hold pessimistic priors

2. Be super-cautious about causal inferences from nonrandom observational comparisons

3. Look for big randomized controlled trials—and focus on their primary analyses

4. Know that stuff that really works is usually obvious (antibiotics for bacterial infection; AEDs to convert VF)

5. Respect uncertainty. Stay humble about most “positive” claims.

This all makes sense, as long as we recognize that randomized controlled trials are themselves nonrandom observational comparisons: the people in the study won’t in general be representative of the population of interest, also issues such as dropout, selection bias, realism of treatments, etc., which can be huge in medical trials. Experimentation is great; we just need to avoid the pitfalls of (a) idealizing studies that have randomization (we should avoid making the “chain is as strong as its strongest link” fallacy) and (b) disparaging observational data without assessing its quality.

For our discussion here, the most relevant bit of Mandrola’s advice was this from the comment thread:

Why are people going to a Political Scientist for medical advice? That is odd.

I hope Prof Gelman’s answer was based on a recognition that he doesn’t have the context and/or the historical background to properly interpret the studies.

The answer is: Yes, I do recognize my ignorance! Here’s what I wrote in the above-linked post:

I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers.

Mandrola’s advice given above seems reasonable to me. But it can be hard for me to apply in that he’s assuming a background medical knowledge that I don’t have. On the other hand, when it comes to social science, I know a lot. For example, when I saw that claim that women during a certain time of the month were 20 percentage points more likely to vote for Barack Obama, it was immediately clear this was ridiculous, because public opinion just doesn’t change that much. This had nothing to do with randomized trials or observational comparisons or anything like that; it was just too noisy of a study to learn anything.

Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”

Erik van Zwet, Sander Greenland, Guido Imbens, Simon Schwab, Steve Goodman, and I write:

We have examined the primary efficacy results of 23,551 randomized clinical trials from the Cochrane Database of Systematic Reviews.

We estimate that the great majority of trials have much lower statistical power for actual effects than the 80 or 90% for the stated effect sizes. Consequently, “statistically significant” estimates tend to seriously overestimate actual treatment effects, “nonsignificant” results often correspond to important effects, and efforts to replicate often fail to achieve “significance” and may even appear to contradict initial results. To address these issues, we reinterpret the P value in terms of a reference population of studies that are, or could have been, in the Cochrane Database.

This leads to an empirical guide for the interpretation of an observed P value from a “typical” clinical trial in terms of the degree of overestimation of the reported effect, the probability of the effect’s sign being wrong, and the predictive power of the trial.

Such an interpretation provides additional insight about the effect under study and can guard medical researchers against naive interpretations of the P value and overoptimistic effect sizes. Because many research fields suffer from low power, our results are also relevant outside the medical domain.

Also this new paper from Zwet with Lu Tian and Rob Tibshirani:

Evaluating a shrinkage estimator for the treatment effect in clinical trials

The main objective of most clinical trials is to estimate the effect of some treatment compared to a control condition. We define the signal-to-noise ratio (SNR) as the ratio of the true treatment effect to the SE of its estimate. In a previous publication in this journal, we estimated the distribution of the SNR among the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We found that the SNR is often low, which implies that the power against the true effect is also low in many trials. Here we use the fact that the CDSR is a collection of meta-analyses to quantitatively assess the consequences. Among trials that have reached statistical significance we find considerable overoptimism of the usual unbiased estimator and under-coverage of the associated confidence interval. Previously, we have proposed a novel shrinkage estimator to address this “winner’s curse.” We compare the performance of our shrinkage estimator to the usual unbiased estimator in terms of the root mean squared error, the coverage and the bias of the magnitude. We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

Let me just repeat that last sentence:

We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

From a Bayesian standpoint, this is no surprise. Bayes is optimal if you average over the prior distribution and can be reasonable if averaging over something close to the prior. Especially reasonable in comparison to naive unregularized estimates (as here).

Erik summarizes:

We’ve determined how much we gain (on average over the Cochrane Database) by using our shrinkage estimator. It turns out to be about a factor 2 more efficient (in terms of the MSE) than the unbiased estimator. That’s roughly like doubling the sample size! We’re using similar methods as our forthcoming paper about meta-analysis with a single trial.

People sometimes ask me how I’ve changed as a statistician over the years. One answer I’ve given is that I’ve gradually become more Bayesian. I started out as a skeptic, concerned about Bayesian methods at all; then in grad school I started using Bayesian statistics in applications and realizing it could solve some problems for me; when writing BDA and ARM, still having the Bayesian cringe and using flat priors as much as possible, or not talking about priors at all; then with Aleks, Sophia, and others moving toward weakly informative priors; eventually under the influence of Erik and others trying to use direct prior information. At this point I’ve pretty much gone full Lindley.

Just as a comparison to where my colleagues and I are now, check out my response in 2008 to a question from Sanjay Kaul about how to specify a prior distribution for a clinical trial. I wrote:

I suppose the best prior distribution would be based on a multilevel model (whether implicit or explicit) based on other, similar experiments. A noninformative prior could be ok but I prefer something weakly informative to avoid your inferences being unduly affected by extremely unrealistic possibilities in the tail of the distribuiton.

Nothing wrong with this advice, exactly, but I was still leaning in the direction of noninformativeness in a way that I would not anymore. Sander Greenland replied at the time with a recommendation to use direct prior information. (And, just for fun, here’s a discussion from 2014 on a topic where Sander and I disagree.)

Erik concludes:

I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?

That last question reminds me of our paper from 2008, Bayes: Radical, Liberal, or Conservative?

P.S. Also this:

You can click through to see the whole story.

P.P.S. More here on the “from defense to offense” thing.

COVID roughly twice as deadly in poorer countries

Hi, this is Lonni.

This is now perhaps old news (May 2022) but I was not a writer on this blog at the time this paper was released. With my co-authors, we decided to analyse serology and mortality data from 62 studies of 25 developing countries. In this paper published in BMJ Global Health we find that age-stratified IFRs are about two times higher than the benchmark metaregression for high-income countries.

Because vaccination and “newer treatments” muddy the waters, we used studies from before those were commonplace in developing countries, and compared against our benchmark for high-income nations. We also corrected for death under-ascertainment which was a very heterogeneous problem across the studies/locations

.

One of the issues in developing countries was the fact that seroprevalence was rather uniform across ages which is a symptom that these countries could not protect their elderly. In high-income countries, seroprevalence tends to be higher amongst the less-vulnerable and young population.

Since the publication, other studies have found similar worrying results, highlighting the importance for future pandemics to assist and distribute vaccines to all without keeping them first for high-income places.

“U.S. Watchdog Halts Studies at N.Y. Psychiatric Center After a Subject’s Suicide”

I don’t know anyone involved in this story and don’t really have anything to add. I just wanted to post on it because it sits at the intersection of science, statistics, and academia. The New York State Psychiatric Institute is involved in a lot of the funded biostatistics research at Columbia University. Ultimately we want to save lives and improve people’s health, but in the meantime we do the work and take the funding without always thinking too much about the people involved. I don’t have any specific study in mind here; I’m just thinking in general terms.

Using simulation from the null hypothesis to study statistical artifacts (ivermectin edition)

Robin Mills, Ana Carolina Peçanha Antonio, and Greg Tucker-Kellogg write:

Background

Two recent publications by Kerr et al. reported dramatic effects of prophylactic ivermectin use for both prevention of COVID-19 and reduction of COVID-19-related hospitalisation and mortality, including a dose-dependent effect of ivermectin prophylaxis. These papers have gained an unusually large public influence: they were incorporated into debates around COVID-19 policies and may have contributed to decreased trust in vaccine efficacy and public health authorities more broadly. . . .

Methods

Starting with initially identified sources of error, we conducted a revised statistical analysis of available data, including data made available with the original papers and public data from the Brazil Ministry of Health. We identified additional uncorrected sources of bias and errors from the original analysis, including incorrect subject exclusion and missing subjects, an enrolment time bias, and multiple sources of immortal time bias. . . .

Conclusions

The inference of ivermectin efficacy reported in both papers is unsupported, as the observed effects are entirely explained by untreated statistical artefacts and methodological errors.

I guess that at this point ivermectin is over (see also here and here); still, just as it’s always good to do good science, even if the results are not surprising, it’s also always good to do good science criticism. From a methods point of view, this new paper by Mills et al. has a pleasant discussion of the value of simulation from the null hypothesis as a way to learn the extent of statistical artifacts; see discussion on page 12 of the above-linked paper.

I say all this in general terms, as I have not read the article in detail. The authors thank me in their acknowledgments so I must have helped them out at some point, but it’s been awhile and now I don’t remember what I actually did!

P.S. “Using simulation from the null hypothesis to study statistical artifacts” is another way of saying “hypothesis testing.”

“Guns, Race, and Stats: The Three Deadliest Weapons in America”

Geoff Holtzman writes:

In April 2021, The Guardian published an article titled “Gun Ownership among Black Americans is Up 58.2%.” In June 2022, Newsweek claimed that “Gun ownership rose by 58 percent in 2020 alone.” The Philadelphia Inquirer first reported on this story in August 2020, and covered it again as recently as March 2023 in a piece titled “The Growing Ranks of Gun Owners.” In between, more than two dozen major media outlets reported this same statistic. Despite inconsistencies in their reporting, all outlets (directly or indirectly) cite as their source a survey-based infographic conducted by a firearm industry trade association.

Last week, I shared my thoughts on the social, political, and ethical dimensions of these stories in an article published in The American Prospect. Here, I address whether and to what extent their key statistical claim is true. And an examination of the infographic—produced by the National Shooting Sports Foundation (NSSF)—reveals that it is not. Below, I describe six key facts about the infographic that undermine the media narrative. After removing all false, misleading, or meaningless words from the Guardian’s headline and Newsweek’s claim, the only words remaining are “Among” “Is,” “In,” and “By.”

(1) 58.2% only refers to the first six months of 2020

To understand demographic changes in firearms purchases or ownership in 2020, one needs to ascertain firearm sales or ownership demographics from before 2020 and after 2020. The best way to do this is with a longitudinal panel, which is how Pew found no change in Black gun ownership rates among Americans from 2017 (24%) to 2021 (24%). Longitudinal research in The Annals of Internal Medicine, also found no change in gun ownership among Black Americans from 2019 (21%) through 2020/2021 (21%).

By contrast, the NSSF conducted a one-time survey of its own member retailers. In July 2020, the NSSF asked these retailers to estimate demographics in the first six months of 2020 to demographics in the first six months of 2019. A full critique of this approach and its drawbacks would require a lengthy discussion of the scientific literature on recency bias, telescoping effects, and so on. To keep this brief, I’d just like to point out that by July 2020, many of us could barely remember what the world was like back in 2019.

Ironically, the media couldn’t even remember when the survey took place. In September 2020, NPR reported—correctly—that “according to AOL News,” the survey concerned “the first six months of 2020.”  But in October of 2020, CNN said it reflected gun sales “through September.” And by June 2021, CNN revised its timeline to be even less accurate, claiming the statistic was “gun buyers in 2020 compared to 2019.”

Strangely, it seems that AOL News may have been one of the few media outlets that actually looked at the infographic it reported. The timing of the survey—along with other critical but collectively forgotten information on its methods are printed at the top of the infographic. The entire top quarter of the NSSF-produced image is devoted to these details:  “FIREARM & AMMUNITION SALES DURING 1ST HALF OF 2020, Online Survey Fielded July 2020 to NSSF Members.”

But as I discuss in my article in The American Prospect, a survey about the first half of 2020 doesn’t really support a narrative about Black Americans’ response to “protests throughout the summer” of 2020 or to that November’s “contested election.” This is a great example of a formal fallacy (post hoc reasoning), memory bias (more than one may have been at work here), and motivated reasoning all rolled into one. To facilitate these cognitive errors, the phrase “in 2020” is used ambiguously in the stories, referring at times to its first six months of 2020 and at times specific days or periods during the last seven months. This part of the headlines and stories is not false, but it does conflate two distinct time periods.

The results of the NSSF survey cannot possibly reflect the events of the Summer and Fall of 2020. Rather, the survey’s methods and materials were reimagined, glossed over, or ignored to serve news stories about those events.

(2) 58.2% describes only a tiny, esoteric fraction of Americans

To generalize about gun owner demographics in the U.S., one has to survey a representative, random sample of Americans. But the NSSF survey was not sent to a representative sample of Americans—it was only sent to NSSF members. Furthermore, it doesn’t appear to have been sent to a random sample of NSSF members—we have almost no information on how the sample of fewer than 200 participants were drawn from the NSSF’s membership of nearly 10,000. Most problematically—and bizarrely—the survey is supposed to tell us something about gun buyers, yet the NSSF chose to send the survey exclusively to its gun sellers.

The word “Americans” in these headlines is being used as shorthand for “gun store customers as remembered by American retailers up to 18 months later.” In my experience, literally no one assumes I mean the latter when I say the former. The latter is not representative of the former, so this part of the headlines and news stories is misleading.

(3) 58.2% refers to some abstract, reconstructed memory of Blackness

The NSSF doesn’t provide demographic information for the retailers it surveyed. Demographics can provide crucial descriptive information for interpreting and weighting data from any survey, but their omission is especially glaring for a survey that asked people to estimate demographics. But there’s a much bigger problem here.

We don’t have reliable information about the races of these retailers’ customers, which is what the word “Black” is supposed to refer to in news coverage of the survey. This is not an attack on firearms retailers; it is a well-established statistical tendency in third-party racial identification. As I’ve discussed in The American Journal of Bioethics, a comparison of CDC mortality data to Census records shows that funeral directors are not particularly accurate in reporting the race of one (perfectly still) person at a time. Since that’s a simpler task than searching one’s memory and making statistical comparisons of all customers from January through June of two different years, it’s safe to assume that the latter tends to produce even less accurate reports.

The word “Black” in these stories really means “undifferentiated masses of people from two non-consecutive six-month periods recalled as Black.” Again, the construct picked out by “Black” in the news coverage is a far cry from the construct actually measured by the survey.

(4) 58.2% appears to be about something other than guns

The infographic doesn’t provide the full wording of survey items, or even make clear how many items there were. Of the six figures on the infographic, two are about “sales of firearms,” two are about “sales of ammunition,” and one is about “overall demographic makeup of your customers.” But the sixth and final figure—the source of that famous 58.2%—does not appear to be about anything at all. In its entirety, that text on the infographic reads: “For any demographic that you had an increase, please specify the percent increase.”

Percent increase in what? Firearms sales? Ammunition sales? Firearms and/or ammunition sales? Overall customers? My best guess would be that the item asked about customers, since guns and ammo are not typically assigned a race. But the sixth figure is uninterpretable—and the 58.2% statistic meaningless—in the absence of answers.

(5) 58.2% is about something other than ownership

I would not guess that the 58.2% statistic was about ownership, unless this were a multiple choice test and I was asked to guess which answer was a trap.

The infographic might initially appear to be about ownership, especially to someone primed by the initial press release. It’s notoriously difficult for people to grasp distinctions like those between purchases by customers and ownership in a broader population. I happen to think that the heuristics, biases, and fallacies associated with that difficulty—reverse inference, base rate neglect, affirming the consequent, etc.—are fascinating, but I won’t dwell on them here. In the end, ammunition is not a gun, a behavior (purchasing) is not a state (ownership), and customers are none of the above.

To understand how these concepts differ, suppose that 80% of people who walk into a given gun store in a given year own a gun. The following year, the store could experience a 58% increase in customers, or a 58% increase in purchases, but not observe a 58% increase in ownership. Why? Because even the best salesperson can’t get 126% of customers to own guns. So the infographic neither states nor implies anything specific about changes in gun ownership.

(6) 58.2% was calculated deceptively

I can’t tell if the data were censored (e.g., by dropping some responses before analysis) or if the respondents were essentially censored (e.g., via survey skip logic), but 58.2% is the average guess only of retailers who reported an increase in Black customers. Retailers who reported no increase in Black customers were not counted toward the average. Consequently, the infographic can’t provide a sample size for this bar chart. Instead, it presents a range of sample sizes for individual bars: “n=19-104.”

Presenting means from four distinct, artificially constructed, partly overlapping samples as a single bar chart without specifying the size of any sample renders that 58.2% number uninterpretable. It is quite possible that only 19 of 104 retailers reported an increase in Black customers, and that all 104 reported an increase in White customers—for whom the infographic (but not the news) reported a 51.9% increase. Suppose 85 retailers did not report an increase in Black customers, and instead reported no change for that group (i.e., a change of 0%). Then if we actually calculated the average change in demographics reported by all survey respondents, we would find just a 10.6% increase in Black customers (19/104 x 58.2%), as compared to a 51.9% increase in white customers (104/104 x 51.9%).

A proper analysis of the full survey data could actually undermine the narrative of a surge in gun sales driven by Black Americans. In fact, a proper calculation may even have found a decrease, not an increase, for this group. The first two bar charts on the infographic report percentages of retailers who thought overall sales of firearms and of ammunition were “up,” “down,” or the “same.” We don’t know if the same response options were given for the demographic items, but if they were, a recount of all votes might have found a decrease in Black customers. We’ll never know.

The 58.2% number is meaningless without additional but unavailable information. Or, to use more technical language, it is a ceilingestimate, as opposed to a real number. In my less-technical write-up, I simply call it a fake number.

This is kind of in the style of our recent article in the Atlantic, The Statistics That Come Out of Nowhere, but with lot more detail. Or, for a simpler example, a claim from a few years ago about political attitudes of the super-rich, which came from a purported survey about which no details were given. As with some of those other claims, the reported number of 58% was implausible on its face, but that didn’t stop media organizations from credulously repeating it.

On the plus side, a few years back a top journal (yeah, you guessed it, it was Lancet, that fount of politically-motivated headline-bait) published a ridiculous study on gun control and, to their credit, various experts expressed their immediate skepticism.

To their discredit, the news media reports on that 58% thing did not even bother running it by any experts, skeptical or otherwise. Here’s another example (from NBC), here’s another (from Axios), here’s CNN . . . you get the picture.

I guess this story is just too good to check, it fits into existing political narratives, etc.

Oooh, I’m not gonna touch that tar baby!

Someone pointed me to a controversial article written a couple years ago. The article remains controversial. I replied that it’s a topic that I’ve not followed any detail and I’ll just defer to the experts. My correspondent pointed to some serious flaws in the article and asked that I link to the article here on the blog. He wrote, “I was unable to find any peer responses to it. Perhaps the discussants on your site will have some insights.”

My reply is the title of this post.

P.S. Not enough information is given in this post to figure out what is the controversial article here, so please don’t post guesses in the comments! Thank you for understanding.

“You need 16 times the sample size to estimate an interaction than to estimate a main effect,” explained

This has come up before here, and it’s also in Section 16.4 of Regression and Other Stories (chapter 16: “Design and sample size decisions,” Section 16.4: “Interactions are harder to estimate than main effects”). But there was still some confusion about the point so I thought I’d try explaining it in a different way.

The basic reasoning

The “16” comes from the following four statements:

1. When estimating a main effect and an interaction from balanced data using simple averages (which is equivalent to least squares regression), the estimate of the interaction has twice the standard error as the estimate of a main effect.

2. It’s reasonable to suppose that an interaction will have half the magnitude of a main effect.

3. From 1 and 2 above, we can suppose that the true effect size divided by the standard error is 4 times higher for the interaction than for the main effect.

4. To achieve any desired level of statistical power for the interaction, you will need 4^2 = 16 times the sample size that you would need to attain that level of power for the main effect.

Statements 3 and 4 are unobjectionable. They somewhat limit the implications of the “16” statement, which does not in general apply to Bayesian or regularized estimates, not does it consider goals other than statistical power (equivalently, the goal of estimating an effect to a desired relative precision). I don’t consider these limitations a problem; rather, I interpret the “16” statement as relevant to that particular set of questions, in the way that the application of any mathematical statement is conditional on the relevance of the framework under which they can be proved.

Statements 1 and 2 are a bit more subtle. Statement 1 depends on what is considered a “main effect,” and statement 2 is very clearly an assumption regarding the applied context of the problem being studied.

First, statement 1. Here’s the math for why the estimate of the interaction has twice the standard error of the estimate of the main effect. The scenario is an experiment with N people, of which half get treatment 1 and half get treatment 0, so that the estimated main effect is ybar_1 – ybar_0, comparing average under treatment and control. We further suppose the population is equally divided between two sorts of people, a and b, and half the people in each group get each treatment. Then the estimated interaction is (ybar_1a – ybar_0a) – (ybar_1b – ybar_0b).

The estimate of the main effect, ybar_1 – ybar_0, has standard error sqrt(sigma^2/(N/2) + sigma^2/(N/2)) = 2*sigma/sqrt(N); for simplicity I’m assuming a constant variance within groups, which will typically be a good approximation for binary data, for example. The estimate of the interaction, (ybar_1a – ybar_0a) – (ybar_1b – ybar_0b), has standard error sqrt(sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4)) = 4*sigma/sqrt(N). I’m assuming that the within-cell standard deviation does not change after we’ve divided the population into 4 cells rather than 2; this is not exactly correct—to the extent that the effects are nonzero, we should expect the within-cell standard deviations to get smaller as we subdivide—; again, however, it is common in applications for the within-cell standard deviation to be essentially unchanged after adding the interaction. This is equivalent to saying that you can add a important predictor without the R-squared going up much, and it’s the usual story in research areas such as psychology, public opinion, and medicine where individual outcomes are highly variable and so we look for effects among averaging.

The biggest challenge with the reasoning in the above two paragraphs is not the bit about sigma being smaller when the cells are subdivided—this is typically a minor concern, and it’s easy enough to account for if necessary—, nor is it the definition of interaction. Rather, the challenge comes, perhaps surprisingly, from the definition of main effect.

Above I define the “main effect” as the average treatment effect in the population, which seems reasonable enough. There is an alternative, though. You could also define the main effect as the average treatment effect in the baseline category. In the notation above, the main effect would then be defined ybar_1a – ybar_0a. In that case, the standard error of the estimated main effect is only sqrt(2) times the standard error of the estimate of the interaction.

Typically I’ll frame the main effect as the average effect in the population, but there are some settings where I’d frame it as the average effect in the baseline category. It depends on how you’re planning to extrapolate the inferences from your model. The important thing is to be clear in your definition.

Now on to statement 2. I’m supposing an interaction that is half the magnitude of the main effect. For example, if the main effect is 20 and the interaction is 10, that corresponds to an effect of 25 in group a and 15 in group b. To me, that’s a reasonable baseline: the treatment effect is not constant but it’s pretty stable, which is kinda what I think about when I hear “main effect.”

But there are other possibilities. Suppose that the effect is 30 in group a and 10 in group b, so the effect is consistently positive effect, but now it varies by a factor of 3 rather under the two conditions. In this case, the main effect is 20 and the interaction is 20. The main effect and the interaction are of equal size, and so you only need 4 times the sample size to estimate the main effect as to estimate the interaction.

Or suppose the effect is 40 in group a and 0 in group b. Then the main effect is 20 and the interaction is 40, and in that case you need the same sample size to estimate the main effect as to estimate the interaction. This can happen! In such a scenario, I don’t know that I’d be particularly interested in the “main effect”—I think I’d frame the problem in terms of effect in group a and effect in group b, without any particular desire to average over them. It will depend on context.

Why this is important

Before going on, let me copy something from my our earlier post explaining the importance of this result: From the statement of the problem, we’ve assumed the interaction is half the size of the main effect. If the main effect is 2.8 on some scale with a standard error of 1 (and thus can be estimated with 80% power; see for example page 295 of Regression and Other Stories, where we explain why, for 80% power, the true value of the parameter must be 2.8 standard errors away from
the comparison point), and the interaction is 1.4 with a standard error of 2, then the z-score of the interaction has a mean of 0.7 and a sd of 1, and the probability of seeing a statistically significant effect difference is pnorm(0.7, 1.96, 1) = 0.10. That’s right: if you have 80% power to estimate the main effect, you have 10% power to estimate the interaction.

And 10% power is really bad. It’s worse than it looks. 10% power kinda looks like it might be OK; after all, it still represents a 10% chance of a win. But that’s not right at all: if you do get “statistical significance” in that case, your estimate is a huge overestimate:

> raw <- rnorm(1e6, .7, 1)
> significant <- raw > 1.96
> mean(raw[significant])
[1] 2.4

So, the 10% of results which do appear to be statistically significant give an estimate of 2.4, on average, which is over 3 times higher than the true effect.

So, yeah, you don’t want to be doing studies with 10% power, which implies that when you’re estimating that interaction, you have to forget about statistical significance; you need to just accept the uncertainty.

Explaining using a 2 x 2 table

Now to return to the main-effects-and-interactions thing:

One way to look at all this is by framing the population as a 2 x 2 table, showing the averages among control and treated conditions within groups a and b:

           Control  Treated  
Group a:  
Group b:  

For example, here’s an example where the treatment has a main effect of 20 and an interaction of 10:

           Control  Treated  
Group a:     100      115
Group b:     150      175

In this case, there’s a big “group effect,” not necessarily causal (I had vaguely in mind a setting where “Group” is an observational factor and “Treatment” is an experimental factor), but still a “main effect” in the sense of a linear model. Here, the main effect of group is 55. For the issues we’re discussing here, the group effect doesn’t really matter, but we need to specify something here in order to fill in the table.

If you’d prefer, you can set up a “null” setting where the two groups are identical, on average, under the control condition:

           Control  Treated  
Group a:     100      115
Group b:     100      125

Again, each of the numbers in these tables represents the population average within the four cells, and “effects” and “interactions” correspond to various averages and differences of the four numbers. We’re further assuming a balanced design with equal sample sizes and equal variances within each cell.

What would it look like if the interaction were twice the size of the main effect, for example a main effect of 20 and an interaction of 40? Here’s one possibility of the averages within each cell:

           Control  Treated  
Group a:     100      100
Group b:     100      140

If that’s what the world is like, then indeed you need exactly the same sample size (that is, the total sample size in the four cells) to estimate the interaction as to estimate the main effect.

When using regression with interactions

To reproduce the above results using linear regression, you’ll want to code the Group and Treatment variables on a {-0.5, 0.5} scale. That is, Group = -0.5 for a and +0.5 for b, and Treatment = -0.5 for control and +0.5 for treatment. That way, the main effect of each variable corresponds to the other variable equaling zero (thus, the average of a balanced population), and the interaction corresponds to the difference of treatment effects, comparing the two groups.

Alternatively we could code each variable on a {-1, 1} scale, in which case the main effects are divided by 2 and the interaction is divided by 4, but the standard errors are also divided in the same way, so the z-scores don’t change, and you still need the same X times the sample size to estimate the interaction as to estimate the man effect.

Or we could code each variable as {0, 1}, in which case, as discussed above, the main effect for each predictor is then defined as the effect of that predictor when the other predictor equals 0.

Why do I make the default assumptions that I do in the above analyses?

The scenario I have in mind is studies in psychology or medicine where a and b are two groups of the population, for example women and men, or young and old people, and researchers start with a general idea, a “main effect,” but there is also interest in how this effects vary, that is, “interactions.” In my scenario, neither a or b is a baseline, and so it makes sense to think of the main effect as some sort of average (which, as discussed here, can take many forms).

In the world of junk science, interactions represent a way out, a set of forking paths that allow researchers to declare a win in settings where their main effect does not pan out. Three examples we’ve discussed to death in this space are the claim of an effect of fat arms on men’s political attitudes (after interacting with parental SES), an effect of monthly cycle on women’s political attitudes (after interacting with partnership status), and an effect of monthly status on women’s clothing choices (after interacting with weather). In all these examples, the main effect was the big story and the interaction was the escape valve. The point of “You need 16 times the sample size to estimate an interaction than to estimate a main effect” is not to say that researchers shouldn’t look for interactions or that they should assume interactions are zero; rather, the point is that they should not be looking for statistically-significant interactions, given that their studies are, at best, barely powered to estimate main effects. Thinking about interactions is all about uncertainty.

In more solid science, interactions also come up: there are good reasons to think that certain treatments will be more effective on some people and in some scenarios. Again, though, in a setting where you’re thinking of interactions as variations on a theme of the main effect, your inferences for interactions will be highly uncertain, and the “16” advice should be helpful both in design and analysis.

Summary

In a balanced experiment, when the treatment effect is 15 in Group a and 25 in Group b (that is, the main effect is twice the size of the interaction), the estimate of the interaction will have twice the standard error as the estimate of the main effect, and so you’d need a sample size of 16*N to estimate the interaction at the same relative precision as you can estimate the main effect from the same design but with a sample size of N.

With other scenarios of effect sizes, the result is different. If the treatment effect is 10 in Group a and 30 in Group b, you’d need 4 times the sample size to estimate the interaction as to estimate the main effect. If the treatment effect is 0 in group a and 40 in Group b, you’d need equal sample sizes.

On a really bad paper on birth month and autism (and why there’s value in taking a look at a clear case of bad research, even if it’s obscure and from many years ago)

In an otherwise unrelated thread on Brutus vs. Mo Willems, an anonymous commenter wrote:

Researchers found that the risk of autism in twins depended on the month they were born in, with January being 80% riskier than December.

The link is from a 2005 article in the fun magazine New Scientist, “Autism: Lots of clues, but still no answers,” which begins:

The risk of autism in twins appears to be related to the month they are born in. The chance of both babies having the disorder is 80 per cent higher for January births than December births.

This was one of the many findings presented at the conference in Boston last week. It typifies the problems with many autism studies: the numbers are too small to be definitive – this one was based on just 161 multiple-birth babies – and even if the finding does stand up, it raises many more questions than it answers.

The article has an excellently skeptical title and lead-off, so I was curious what’s up with the author, Celeste Biever. A quick search shows that she’s currently Chief News and Features editor at Nature, so still in the science writing biz. That’s good!

The above link doesn’t give the full article but I was able to read the whole thing through the Columbia University library. The relevant part is that one of the authors of the birth-month study was Craig Newschaffer of the Johns Hopkins School of Public Health. I searched for *Craig Newschaffer autism birth month* on Google Scholar and found an article, “Variation in season of birth in singleton and multiple births concordant for autism spectrum disorders,” by L. C. Lee, C. J. Newschaffer, et al., published in 2008 in Paediatric and Perinatal Epidemiology.

I suppose that, between predatory journals and auto-writing tools such as Galactica, the scientific literature will be a complete mess in a few years, but for now we can still find papers from 2008 and be assured that they’re the real thing.

The searchable online version only gave the abstract and references, but again I could find the full article through the Columbia library. And I can report to you that the claim that the “chance of both babies having the disorder is 80 per cent higher for January births than December births,” is not supported by the data.

Let’s take a look. From the abstract:

This study aimed to determine whether the birth date distribution for individuals with autism spectrum disorders (ASD), including singletons and multiple births, differed from the general population. Two ASD case groups were studied: 907 singletons and 161 multiple births concordant for ASD.

161 multiple births . . . that’s about 13 per month, sounds basically impossible for there to be any real evidence of different frequencies comparing December to January. But let’s see what the data say.

From the article:

Although a pattern of birth seasonality in autism was first reported in the early 1980s, the findings have been inconsistent. The first study to examine autism births by month was conducted by Bartlik more than two decades ago. That study compared the birth month of 810 children diagnosed with autism with general births and reported that autism births were higher than expected in March and August; the effect was more pronounced in more severe cases. A later report analysed data from the Israeli national autism registry which had information on 188 individuals diagnosed with autistic disorder. It, too, demonstrated excess births in March and August. Some studies, however, found excess autism births in March only.

March and August, huh? Sounds like noise mining to me.

Anyway, that’s just the literature. Now on to the data. First they show cases by day:

Ok, that was silly, no real reason to have displayed it at all. Then they have graphs by month. They use some sort of smoothing technique called Empiric Mode Decomposition, whatever. Anyway, here’s what they’ve got, first for autistic singleton births and then for autistic twins:

Looks completely random to me. The article states:

In contrast to the trend of the singleton controls, which were relatively flat throughout the year, increases in the spring (April), the summer (late July) and the autumn (October) were found in the singleton ASD births (Fig. 2). Trends were also observed in the ASD concordant multiple births with peaks in the spring (March), early summer (June) and autumn (October). These trends were not seen in the multiple birth controls. Both ASD case distributions in Figs. 2 and 3 indicated a ‘valley’ during December and January. Results of the non-parametric time-series analyses suggested there were multiple peaks and troughs whose borders were not clearly bound by month.

C’mon. Are you kidding me??? Then this:

Caution should be used in interpreting the trend for multiple concordant births in these analyses because of the sparse available data.

Ya think?

Why don’t they cut out the middleman and just write up a bunch of die rolls.

Then this:

Figures 4 and 5 present relative risk estimates from Poisson regression after adjusting for cohort effects. Relative risk for multiple ASD concordant males was 87% less in December than in January with 95% CIs from 2% to 100%. In addition, excess ASD concordant multiple male births were indicated in March, May and September, although they were borderline for statistical significance.

Here are the actual graphs:

No shocker that if you look at 48 different comparisons, you’ll find something somewhere that’s statistically significant at the 5% level and a couple more items that are “borderline for statistical significance.”

This is one of these studies that (a) shows nothing, and (b) never had a chance. Unfortunately, statistics education and practice is focused on data analysis and statistical significance, not so much on design. This is just a ridiculously extreme case of noise mining.

In addition, I came across an article, The Epidemiology of Autism Spectrum Disorders, by Newschaffer et al. published in the Annual Review of Public Health in 2007 that doesn’t mention birth month at all. So, somewhere between 2005 and 2007, it seems that Newschaffer decided that, whatever birth-month effects were out there weren’t important enough to include in a 20-page review article. Then a year later they published a paper with all sorts of bold claims. Does not make a lot of sense to me.

Shooting a rabbit with a cannon?

Ok, this is getting ridiculous, you might say. Here we are picking to death an obscure paper from 15 years ago, an article we only heard about because it was indirectly referred to in a news article from 2005 that someone mentioned in a blog comment.

Is this the scientific equivalent to searching for offensive quotes on twitter and then getting offended? Am I just being mean to go through the flaws of this paper from the archives?

I don’t think so. I think there’s a value to this post, and I say it for two reasons.

1. Autism is important! There’s a reason why the government funds a lot of research on the topic. From the above-linked paper:

The authors gratefully acknowledge the following people and institutions for their resources and support on this manuscript:
1 The Autism Genetic Resource Exchange (AGRE) Consortium. AGRE is a programme of Cure Autism Now and is supported, in part, by Grant MH64547 from the National Institute of Mental Health to Daniel H. Geschwind.
2 Robert Hayman, PhD and Isabelle Horon, DrPH at the Maryland Department of Health and Mental Hygiene Vital Statistics Administration for making Maryland State aggregated birth data available for this analysis.
3 Rebecca A. Harrington, MPH, for editorial and graphic support.
Drs Lee and Newschaffer were supported by Centers for Disease Control and Prevention cooperative agreement U10/CCU320408-05, and Dr. Zimmerman and Ms. Shah were supported by Cure Autism Now and by Dr Barry and Mrs Renee Gordon. A preliminary version of this report was presented in part at the International Meeting for Autism Research, Boston, MA, May 2005.

This brings us to two points:

1a. All this tax money spent on a hopeless study of monthly variation in a tiny dataset is money that wasn’t spent on more serious research into autism or for that matter on direct services of some sort. Again, the problem with this study is not just that the data are indistinguishable from pure noise. The problem is that, even before starting the study, a competent analysis would’ve found that there was not enough data here to learn anything useful.

1b. Setting funding aside, attention given to this sort of study (for example, in that 2005 meeting and in the New Scientist article) is attention not being given to more serious research on the topic. To the extent that we are concerned about autism, we should be concerned about this diversion of attentional resources. At best, other researchers will just ignore this sort of pure-noise study; at worst, other researchers will take it seriously and waste more resources following it up in various ways.

Now, let me clarify that I’m not saying the authors who did this paper are bad people or that they were intending to waste government money and researchers’ attention. I can only assume they were 100% sincere and just working in a noise-mining statistical paradigm. This was 2005, remember, before “p-hacking,” “researcher degrees of freedom,” and “garden of forking paths” became commonly understood concepts in the scientific community. They didn’t know any better! They were just doing what they were trained to do: gather data, make comparisons, highlight “statistical significance” and “borderline statistical significance,” and tell stories. That’s what quantitative research was!

And that brings us to our final point:

2. That noise-mining paradigm is still what a lot of science and social science looks like. See here, for example. We’re talking about sincere, well-meaning researchers, plugged into the scientific literature and, unfortunately, pulling patterns out of what is essentially pure noise. Some of this work gets published in top journals, some of it gets adoring press treatment, some of it wins academic awards. We’re still there!

For that reason, I think there’s value in taking a look at a clear case of bad research. Not everything’s a judgment call. Some analyses are clearly valueless. Another example is the series of papers by that sex-ratio researcher, all of which are a mixture of speculative theory and pure noise mining, and all of which would be stronger without the distraction of data. Again, they’d be better off just reporting some die rolls; at least then the lack of relevant information content would be clearer.

P.S. One more time: I’m not saying the authors of these papers are bad people. They were just doing what they were trained to do. It’s our job as statistics teachers to change that training; it’s also the job of the scientific community not to reward noise-mining—even inadvertent noise-mining—as a career track.

Postdoc on Bayesian methodological and applied work! To optimize patient care! Using Stan! In North Carolina!

Sam Berchuck writes:

I wanted to bring your attention to a postdoc opportunity in my group at Duke University in the Department of Biostatistics & Bioinformatics. The full job ad is here: https://forms.stat.ufl.edu/statistics-jobs/entry/10978/.

The postdoc will work on Bayesian methodological and applied work, with a focus on modeling complex longitudinal biomedical data (including electronic health records and mobile health data) to create data-driven approaches to optimize patient care among patients with chronic diseases. The position will be particularly interesting to people interested in applying Bayesian statistics in real-world big data settings. We are looking for people who have experience in Bayesian inference techniques, including Stan!

Interesting. In addition to the Stan thing, I’m interested in data-driven approaches to optimize patient care. This is an area where a Bayesian approach, or something like it, is absolutely necessary, as you typically just won’t have enough data to make firm conclusions about individual effects, so you have to keep track of uncertainty. Sounds like a wonderful opportunity.

Wow—those are some really bad referee reports!

Dale Lehman writes:

I missed this recent retraction but the whole episode looks worth your attention. First the story about the retraction.

Here are the referee reports and authors responses.

And, here is the author’s correspondence with the editors about retraction.

The subject of COVID vaccine safety (or lack thereof) is certainly important and intensely controversial. The study has some fairly remarkable claims (deaths due to the vaccines numbering in the hundreds of thousands). The peer reviews seem to be an exemplary case of your statement that “the problems with peer review are the peer reviewers). The data and methodology used in the study seem highly suspect to me – but the author appears to respond to many challenges thoughtfully (even if I am not convinced) and raises questions about the editorial practices involved with the retraction.

Here are some more details on that retracted paper.

Note the ethics statement about no conflicts – doesn’t mention any of the people supposedly behind the Dynata organization. Also, I was surprised to find the paper and all documentation still available despite being retracted. It includes the survey instrument. From what I’ve seen, the worst aspect of this study is that it asked people if they knew people who had problems after receiving the vaccine – no causative link even being asked for. That seems like an unacceptable method for trying to infer deaths from the vaccine – and one that the referees should never have permitted.

The most amazing thing about all this was the review reports. From the second link above, we see that the article had two review reports. Here they are, in their entirety:

The first report is an absolute joke, so let’s just look at the second review. The author revised in response to that review by rewriting some things, then the paper was published. At no time were any substantive questions raised.

I also noticed this from the above-linked news article:

“The study found that those who knew someone who’d had a health problem from Covid were more likely to be vaccinated, while those who knew someone who’d experienced a health problem after being vaccinated were less likely to be vaccinated themselves.”

Here’s a more accurate way to write it:

“The study found that those who SAID THEY knew someone who’d had a health problem from Covid were more likely to be SAY THEY WERE vaccinated, while those who SAID THEY knew someone who’d experienced a health problem after being vaccinated were less likely to SAY THEY WERE vaccinated themselves.”

Yes, this is sort of thing arises with all survey responses, but I think the subjectivity of the response is much more of a concern here than in a simple opinion poll.

The news article, by Stephanie Lee, makes the substantive point clearly enough:

This methodology for calculating vaccine-induced deaths was rife with problems, observers noted, chiefly that Skidmore did not try to verify whether anyone counted in the death toll actually had been vaccinated, had died, or had died because of the vaccine.

Also this:

Steve Kirsch, a veteran tech entrepreneur who founded an anti-vaccine group, pointed out that the study had the ivory tower’s stamp of approval: It had been published in a peer-reviewed scientific journal and written by a professor at Michigan State University. . . .

In a sympathetic interview with Skidmore, Kirsch noted that the study had been peer-reviewed. “The journal picks the peer reviewers … so how can they complain?” he said.

Ultimately the responsibility for publishing a misleading article falls upon the article’s authors, not upon the journal. You can’t expect or demand careful reviews from volunteer reviewers, nor can you expect volunteer journal editors to carefully vet every paper they will publish. Yes, the peer reviews for the above-discussed paper were useless—actually worse than useless, in that they gave a stamp of approval to bad work—but you can’t really criticize the reviewers for “not doing their jobs,” given that reviewing is not their job—they’re doing it for free.

Anyway, it’s a good thing that the journal shared the review reports so we can see how useless they were.

“Wait, is everybody wearing glasses nowadays?”

Paul Alper points to this fun news article by Andrew Van Dam, who runs the “Department of Data” column for the Washington Post. Van Dam writes:

According to our analysis of more than 110,000 responses to the National Health Interview Survey conducted by the Census Bureau on behalf of the National Center for Health Statistics, 62 percent of respondents said they donned some form of corrective eyewear in a recent three year-period. . . .

The ubiquity of eyeglasses in your personal universe will change depending on whether you’re hanging out with young legal workers (ages 25 to 39) or your friends who work in agriculture or construction. That’s because the legal workers are more than twice as likely to wear glasses.

What’s actually going on here? If good vision is hereditary, as we assume, how could your occupation determine your need for vision correction? . . . we called eye-data expert Bonnielin Swenor, director of the Johns Hopkins Disability Health Research Center. Swenor pointed us to her friend and colleague, Johns Hopkins Wilmer Eye Institute pediatric ophthalmologist and researcher Megan Collins, who appears to know everything about eyeballs. . . .

It turns out that, yes, myopia [nearsightedness] is on the march. In a 2009 JAMA Ophthalmology publication, National Institutes of Health ophthalmologists found that the prevalence of myopia had increased from 25 percent of the population age 12 to 54 in 1971 and 1972 to 42 percent of people in that age range in 1999 to 2004. The study was based on thousands of physical exams conducted for the National Health and Nutrition Examination Survey. . . .

Swenor and Collins explain that while kids may not have changed, the world around them sure has. And key changes in the way kids grow up — many associated with urban living and consumer technology — have been hard on the eyes. . . . According to a review of myopia research, spending time outdoors is one of the best things a kid can do for healthy eye growth. . . . Outdoor light may help your eyes grow, and being outside gives your eyeballs more opportunities to flex their muscles by focusing on distant objects. While data is surprisingly scarce, available evidence suggests kids may spend less time outdoors than they did a generation or two ago. . . .

Myopia has risen even more rapidly in East Asia, where countries have attempted sweeping remedies. A program in Taiwan, for example, encouraged students to participate in two hours of outdoor activity every day. After it began in 2010, researchers found in the journal Ophthalmology, Taiwan’s long rise in myopia went into reverse.

People also are more educated today, and many studies find that the more education you have, the more likely you are to be myopic. That correlation, of course, is probably related to the first two factors: To get a diploma or degree, you’ll probably spend more time indoors studying. . . . Education gaps often accompany much of the difference in myopia — and thus the glasses gap — among groups: Women are more likely to wear glasses than men. High earners are more likely to wear glasses than low earners. And Asian and White Americans are more likely to wear glasses than their Black and Hispanic compatriots.

Of course, myopia is not the only reason a more-educated person might be more likely to wear glasses. “There are a number of other factors that may be at play too,” Collins said, “including cost of eyeglasses, access to vision care, health literacy, or trust in the health-care system.” More educated Americans are also more likely to be doing jobs that require near work, such as typing or reading, and thus more likely to don reading glasses to compensate for the slow advance of presbyopia.

While myopia is an easily corrected annoyance for many of us, Swenor says its rising prevalence is also a bona fide public health issue. When the eyeball elongates, the stretching can damage the wall of your retina and cause permanent, non-correctible vision loss such as myopic macular degeneration. . . .

This is just great, an exemplar of newspaper science writing. Let me count the ways:

1. Lots of data graphics

2. Quotes with outside experts

3. This: “While data is surprisingly scarce, available evidence suggests . . .” I looooove this recognition of uncertainty.

This was so great that I added Van Dam’s column to our Blogs We Read page.

(again) why I don’t like to talk so much about “p-hacking.” But sometimes the term is appropriate!

Part 1

Jonathan Falk points us to this parody article that has suggestions on how to p-hack.

I replied that I continue to be bothered by the term “p-hacking.” Sometimes it applies very clearly (as in the work of Brian Wansink, although it’s a mystery why he felt the need to p-hack given that it seems that his data could never have existed as reported), but other times there is no “hacking” going on. So I prefer the term forking paths.

Two things going on here:

1. Saying “p-hacking” when it’s forking paths is uncharitable, as it implies active “hacking” when it can well be that researchers are just following the data in what seems like a reasonable way.

2. Bad researchers looove to conflate the professional and the personal. Say they’re p-hacking and they’ll get in a huff: “Who are you to accuse me of misconduct??”, etc. Say they have forking paths and you remove, or at least, reduce, that argument. OK, in real life, yeah, people will say, “Who are you to accuse me of forking paths?”, but forking paths is just a thing that happens, an inevitable result of data processing and analysis plans that were not decided ahead of time.

So, yeah, humor aside, I don’t like the p-hacking talk, for similar reasons to my not liking the “file drawer” thing: in both cases, the focus on a specific mechanism can serve to minimize the real problem, to conflate scientific mistakes with intentional misconduct, and to provide an easy out for many practitioners of bad science who don’t seem to realize that honesty and transparency are not enuf.

Falk responds:

I agree completely with that.

But honestly, I feel like both the garden of forking paths and p-hacking are just versions of Bitcoin’s Proof of Work method. You get rewards for showing how much effort you had to go to get SIGNIFICANCE. If you have a study with a p-value on your first try of 1e-8, people will say “But that result was obvious! Why do they even bother with a test?” If you garden-of-forking-paths or p-hack your way to 0.047, you will be credited for your perspicacity.

Part 2

Ethan Steinberg writes:

I just came across an article that will probably be interesting to you and your readers. Back in 2022, the Florida Surgeon General released a report that the COVID vaccine appeared to be statistically significantly correlated with cardiac arrest “In the 28 days following vaccination, a statistically significant increase in cardiac-related deaths was detected for the entire study population (RI = 1.07, 95% CI = 1.03 – 1.12).” is the full report.

This was then used to recommend against COVID vaccines for young men in particular.

A local Florida paper just obtained and released the original versions of the reports:

Here are the drafts, from first to last.

The TLDR is that the original analysis did not find significant increases in cardiac related deaths. They had to go through a lot of analysis variants / drafts to get the result they were looking for.

I guess the real question here is how this could be avoided in the future. Maybe we should expect public health officials to register their analysis in advance?

I don’t think we should ask public health officials to register their analysis in advance, as that just seems like more of a mess. But in any case the above seems like an example where there really was p-hacking.

P.S. Just to clarify: As always, the problem is not with the “hacking”—looking at data in many different ways—but rather in only reporting some small subset of the analyses. It’s fine to go through a lot of analyses of the data; then, you should publish all of it, or publish a single analysis that incorporates all of what you’ve done using multilevel modeling.