We want to go beyond intent-to-treat analysis here, but we can’t. Why? Because of this: “Will the data collected for your study be made available to others?” “No”; “Would you like to offer context for your decision?” “–“. Millions of taxpayer dollars spent, and we don’t get to see the data.

Dale Lehman writes:

Let me be the first (or not) to ask you to blog about this just released NEJM study. Here are the study, supplementary appendix, and data sharing statement, and I’ve also included the editorial statement. The study is receiving wide media attention and is the continuation of a long-term trial that was reported on at a 10 year median follow-up. The current publication is for a 15 year median follow-up.

The overall picture is consistent with many other studies – prostate cancer is generally slow to develop and kills very few men. Intervention can have serious side effects and there is little evidence that it improves long-term survival, except (perhaps) in particular subgroups. Treatment and diagnosis has undergone considerable change in the past decade. The issue is of considerable interest to me – for statistical reasons as well as personal (since I have a prostate cancer diagnosis). Here are my concerns in brief:

This study once again brings up the issue of intention-to-treat vs actual treatment. The groups were randomized between active management (545 men), prostatectomy (533 men), and radiotherapy (545 men). The analysis was based on these groups, with deaths in the 3 groups of 17, 12, and 16 respectively. Figure 1 in the paper reveals that within the first year, 628 men were actually in the active surveillance group, and 488 in each of the other 2 groups: this is not surprising since many people resist the invasive treatment and possible side effects. I would consider those that chose different groups than the random assignment within the first year as the true effective group sizes. However, the paper does not provide data on the actual deaths for the people that switched between the random assignment and actual treatment within the first year. So, it is not possible to determine the actual death rates in the 3 groups.

The paper reports death rates of 3.1%, 2.2%, and 2.9% in the 3 groups. If we just change the denominators to the actual size of the 3 groups in the first year, the 3 death rates are 2.7%, 2.5%, and 3.3%, making intervention look even worse. If we assume that half of the deaths in the random prostatectomy radiotherapy groups were among those that refused the initial treatment and opted for active surveillance, then the 3 death rates would be 4.9%, 1.2%, and 1.6% respectively, making active surveillance look rather risky. Of course, I think allocating half of the deaths in those groups in this manner is a fairly extreme assumption. Given the small numbers of deaths involved, the deviations from random assignment to actual treatment could matter.

The authors have the data to conduct both an intention to treat and actual treatment received comparison, but did not report this (and did not indicate that they did such a study). If they had reported details on the 45 total deaths, I could do that analysis myself, but they don’t provide that data. In fact, the data sharing statement (attached) is quite remarkable – will the data be provided? “No.” That really irks me. I don’t see that there is really any concern about privacy. Withholding the data serves to bolster the careers of the researchers and the prestige of the journal, but it doesn’t have to be that way. If the journal released the data publicly and it was carefully documented, both the authors and the journal could receive widespread recognition for their work. Instead, they (and much of the establishment) choose to rely on their analysis to bolster their reputations. But these days the analysis is the easy part, it is the data curation and quality that is hard. Once again, the incentives and rewards are at odds with what makes sense.

Another question that is not analyzed but could be if the data was provided, is whether the time of randomization matters. The article (and the editorial) cites the improved monitoring as MRI images are increasingly used along with biopsies. Given this evolution, the relative performance of the 3 groups might be changing over time – but no analysis is provided based on the year upon which a person entered the study.

One other thing that you’ve blogged about often. For me, the most interesting figure is Figure S1 that actually shows the 45 deaths for the 3 groups. Looking at it, I see a tendency for the deaths to occur earlier with active surveillance than either surgery or radiation. Of course, the p values suggest that this might just be random noise. Indeed it might be. But, as we often say, absence of evidence is not evidence of absence. The paper appears to overstate the findings, as does all the media reporting. Statements such as “Radical treatment resulted in a lower risk of disease progression than active monitoring but did not lower prostate cancer mortality” (page 10 of the article) amounts to a finding of now effect rather than a failure to find a significant effect. Null hypothesis significance testing strikes again.

Yeah, they should share the goddam data, which was collected using tons of taxpayer dollars:

Regarding the intent-to-treat thing: Yeah, this has come up before, and I’m not sure what to do; I just have the impression that our current standard approaches here have serious problems.

My short answer is that some modeling should be done. Yes, the resulting inferences will depend on the model, but that’s just the way things are; it’s the actual state of our knowledge. But that’s just cheap talk from me. I don’t have a model on offer here, I just think that’s the way to go: construct a probabilistic model for the joint distribution of the all the variables (which treatment the patient chooses, along with the health outcome) conditional on patient characteristics, and go from there.

I agree with Lehman that the intent-to-treat analysis is not the main goal here. It’s fine to do that analysis but it’s not good to stop there, and it’s really not good to hide information that could be used to go further.

As Lehman puts it:

Intent-to-treat analysis makes sense from a public health point of view if it closely reflects the actual medical practice. But from a patient point of view of making a decision regarding treatment, the actual treatment is more meaningful than intent-to-treat. So, when the two estimates differ considerably, it seems to me that they should both be reported – or, at least, the data should be provided that would allow both analyses to be done.

Also, the topic is relevant to me cos all of a sudden I need to go to the bathroom all the time. My doctor says my PSA is ok so I shouldn’t worry about cancer, but it’s annoying!

I told this to Lehman, who responded:

Unfortunately, the study in question makes PSA testing even less worthwhile than previously thought (I get mine checked regularly and that is my only current monitoring, but it is not looking like that is worth much, or should I say there is no statistically significant (p>.05) evidence that it means anything?

Damn.

Causal inference and the aggregation of micro effects into macro effects: The effects of wages on employment

James Traina writes:

I’m an economist at the SF Fed. I’m writing to ask for your first thoughts or suggested references on a particular problem that’s pervasive in my field: Aggregation of micro effects into macro effects.

This is an issue that has been studied since the 80s. For example, the individual-level estimates of wages on employment using quasi-experimental tax variation are much smaller than aggregate-level estimates using time series variation. More recently, there has been an active debate on how to port individual-level estimates of government transfers on consumption to macro policy.

Given your expertise, I was wondering if you had insight into how you or other folks in the stats / causal inference field would approach this problem structure more generally.

My reply: Here’s a paper from 2006, Multilevel (hierarchical) modeling: What it can and cannot do. The short answer is that you can estimate micro and macro effects in the same model, but you don’t necessarily have causal identification at both levels. It depends on the design.

You’ll also want a theoretical model. For example, in your model, if you want to talk about “the effects of wages,” it can help to consider potential interventions that could affect local wages. Such interventions could be a minimum-wage law, it could be inflation that reduces real (not nominal) wages, it could be national economic conditions that make the labor market more or less competitive, etc. You can also think about potential interventions at an individual level, such as a person getting education or training, marrying or having a child, the person’s employer changing its policies, whatever.

I don’t know enough about your application to give more detail. The point is that “wages” is not in itself a treatment. Wages is a measured variable, and different wage-effecting treatments can have different effects on employment. You can think of these as instruments, even if you’re not actually doing an instrumental variables analysis. Also, treatments that affect individual wages will be different than treatments that affect aggregate wages, so it’s no surprise that they would have different effects on employment. There’s no strong theoretical reason to think that effects would be the same.

Finally, I don’t understand how government transfers connect to wages in your problem. Government transfers do not directly affect wages, do they? So I feel like I’m missing some context here.

Explore Ledger Live, the ultimate crypto companion. Securely manage your digital assets, track market trends, and execute trades with ease. Ledger Live: where security meets simplicity.

New research on social media during the 2020 election, and my predictions

Back in 2020, leading academics and researchers at the company now known as Meta put together a large project to study social media and the 2020 US elections — particularly the roles of Instagram and Facebook. As Sinan Aral and I had written about how many paths for understanding effects of social media in elections could require new interventions and/or platform cooperation, this seemed like an important development. Originally the idea was for this work to be published in 2021, but there have been some delays, including simply because some of the data collection was extended as what one might call “election-related events” continued beyond November and into 2021. As of 2pm Eastern today, the news embargo for this work has been lifted on the first group of research papers.

I had heard about this project back a long time ago and, frankly, had largely forgotten about it. But this past Saturday, I was participating in the SSRC Workshop on the Economics of Social Media and one session was dedicated to results-free presentations about this project, including the setup of the institutions involved and the design of the research. The organizers informally polled us with qualitative questions about some of the results. This intrigued me. I had recently reviewed an unrelated paper that included survey data from experts and laypeople about their expectations about the effects estimated in a field experiment, and I thought this data was helpful for contextualizing what “we” learned from that study.

So I thought it might be useful, at least for myself, to spend some time eliciting my own expectations about the quantities I understood would be reported in these papers. I’ve mainly kept up with the academic and  grey literature, I’d previously worked in the industry, and I’d reviewed some of this for my Senate testimony back in 2021. Along the way, I tried to articulate where my expectations and remaining uncertainty were coming from. I composed many of my thoughts on my phone Monday while taking the subway to and from the storage unit I was revisiting and then emptying in Brooklyn. I got a few comments from Solomon Messing and Tom Cunningham, and then uploaded my notes to OSF and posted a cheeky tweet.

Since then, starting yesterday, I’ve spoken with journalists and gotten to view the main text of papers for two of the randomized interventions for which I made predictions. These evaluated effects of (a) switching Facebook and Instagram users to a (reverse) chronological feed, (b) removing “reshares” from Facebook users’ feeds, and (c) downranking content by “like-minded” users, Pages, and Groups.

My guesses

My main expectations for those three interventions could be summed up as follows. These interventions, especially chronological ranking, would each reduce engagement with Facebook or Instagram. This makes sense if you think the status quo is somewhat-well optimized for showing engaging and relevant content. So some of the rest of the effects — on, e.g., polarization, news knowledge, and voter turnout — could be partially inferred from that decrease in use. This would point to reductions in news knowledge, issue polarization (or coherence/consistency), and small decreases in turnout, especially for chronological ranking. This is because people get some hard news and political commentary they wouldn’t have otherwise from social media. These reduced-engagement-driven effects should be weakest for the “soft” intervention of downranking some sources, since content predicted to be particularly relevant will still make it into users’ feeds.

Besides just reducing Facebook use (and everything that goes with that), I also expected swapping out feed ranking for reverse chron would expose users to more content from non-friends via, e.g., Groups, including large increases in untrustworthy content that would normally rank poorly. I expected some of the same would happen from removing reshares, which I expected would make up over 20% of views under the status quo, and so would be filled in by more Groups content. For downranking sources with the same estimated ideology, I expected this would reduce exposure to political content, as much of the non-same-ideology posts will be by sources with estimated ideology in the middle of the range, i.e. [0.4, 0.6], which are less likely to be posting politics and hard news. I’ll also note that much of my uncertainty about how chronological ranking would perform was because there were a lot of unknown but important “details” about implementation, such as exactly how much of the ranking system really gets turned off (e.g., how much likely spam/scam content still gets filtered out in an early stage?).

How’d I do?

Here’s a quick summary of my guesses and the results in these three papers:

Table of predictions about effects of feed interventions and the results

It looks like I was wrong in that the reductions in engagement were larger than I predicted: e.g., chronological ranking reduced time spent on Facebook by 21%, rather than the 8% I guessed, which was based on my background knowledge, a leaked report on a Facebook experiment, and this published experiment from Twitter.

Ex post I hypothesize that this is because of the duration of these experiments allowed for continual declines in use over months, with various feedback loops (e.g., users with chronological feed log in less, so they post less, so they get fewer likes and comments, so they log in even less and post even less). As I dig into the 100s of pages of supplementary materials, I’ll be looking to understand what these declines looked like at earlier points in the experiment, such as by election day.

My estimates for the survey-based outcomes of primary interest, such as polarization, were mainly covered by the 95% confidence intervals, with the exception of two outcomes from the “no reshares” intervention.

One thing is that all these papers report weighted estimates for a broader population of US users (population average treatment effects, PATEs), which are less precise than the unweighted (sample average treatment effect, SATE) results. Here I focus mainly on the unweighted results, as I did not know there was going to be any weighting and these are also the more narrow, and thus riskier, CIs for me. (There seems to have been some mismatch between the outcomes listed in the talk I saw and what’s in the papers, so I didn’t make predictions for some reported primary outcomes and some outcomes I made predictions for don’t seem to be reported, or I haven’t found them in the supplements yet.)

Now is a good time to note that I basically predicted what psychologists armed with Jacob Cohen’s rules of thumb might call extrapolate to “minuscule” effect sizes. All my predictions for survey-based outcomes were 0.02 standard deviations or smaller. (Recall Cohen’s rules of thumb say 0.1 is small, 0.5 medium, and 0.8 large.)

Nearly all the results for these outcomes in these two papers were indistinguishable from the null (p > 0.05), with standard errors for survey outcomes at 0.01 SDs or more. This is consistent with my ex ante expectations that the experiments would face severe power problems, at least for the kind of effects I would expect. Perhaps by revealed preference, a number of other experts had different priors.

A rare p < 0.05 result is that that chronological ranking reduced news knowledge by 0.035 SDs with 95% CI [-0.061, -0.008], which includes my guess of -0.02 SDs. Removing reshares may have reduced news knowledge even more than chronological ranking — and by more than I guessed.

Even with so many null results I was still sticking my neck out a bit compared with just guessing zero everywhere, since in some cases if I had put the opposite sign my estimate wouldn’t have been in the 95% CI. For example, downranking “like-minded” sources produced a CI of [-0.031, 0.013] SDs, which includes my guess of -0.02, but not its negation. On the other hand, I got some of these wrong, where I guessed removing reshares would reduce affective polarization, but a 0.02 SD effect is outside the resulting [-0.005, +0.030] interval.

It was actually quite a bit of work to compare my predictions to the results because I didn’t really know a lot of key details about exact analyses and reporting choices, which strikingly even differ a bit across these three papers. So I might yet find more places where I can, with a lot of reading and a bit of arithmetic, figure out where else I may have been wrong. (Feel free to point these out.)

Further reflections

I hope that this helps to contextualize the present results with expert consensus — or at least my idiosyncratic expectations. I’ll likely write a bit more about these new papers and further work released as part of this project.

It was probably an oversight for me not to make any predictions about the observational paper looking at polarization in exposure and consumption of news media. I felt like I had a better handle on thinking about simple treatment effects than these measures, but perhaps that was all the more reason to make predictions. Furthermore, given the limited precision of the experiments’ estimates, perhaps it would have been more informative (and riskier) to make point predictions about these precisely estimated observational quantities.

[This post is by Dean Eckles. I want to note that I was an employee or contractor of Facebook (now Meta) from 2010 through 2017. I have received funding for other research from Meta, Meta has sponsored a conference I organize, and I have coauthored with Meta employees as recently as earlier this month. I was also recently a consultant to Twitter, ending shortly after the Musk acquisition. You can find all my disclosures here.]

Do Ultra-Processed Data Cause Excess Publication and Publicity Gain?

Ethan Ludwin-Peery writes:

I was reading this paper today, Ultra-Processed Diets Cause Excess Calorie Intake and Weight Gain (here, PDF attached), and the numbers they reported immediately struck me as very suspicious.

I went over it with a collaborator, and we noticed a number of things that we found concerning. In the weight gain group, people gained 0.9 ± 0.3 kg (p = 0.009), and in the weight loss group, people lost 0.9 ± 0.3 kg (p = 0.007). These numbers are identical, which is especially suspicious since the sample size is only 20, which is small enough that we should really expect more noise. What are the chances that there would be identical average weight loss in the two conditions and identical variance? We also think that 0.3 kg is a suspiciously low standard error for weight fluctuation.

They also report that weight changes were highly correlated with energy intake (r = 0.8, p < 0.0001). This correlation coefficient seems suspiciously high to us. For comparison, the BMI of identical twins is correlated at about r = 0.8, and about r = 0.9 for height. Their data is publicly available here, so we took a look and found more to be concerned about. They report participant weight to two decimal places in kilograms for every participant on every day. Kilograms to two decimal places should be pretty sensitive (an ounce of water is about 0.02 kg), but we noticed that there were many cases where the exact same weight appeared for a participant two or even three times in a row. For example participant 21 was listed as having a weight of exactly 59.32 kg on days 12, 13, and 14, participant 13 was listed as having a weight of exactly 96.43 kg on days 10, 11, and 12, and participant 6 was listed as having a weight of exactly 49.54 kg on days 23, 24, and 25.

In fact this last case is particularly egregious, as 49.54 kg is exactly one kilogram less, to two decimal places, than the baseline for this participant’s weight when they started, 50.54 kg. Participant 6 only ever seems to lose or gain weight in increments of 0.10 kilograms. Similar patterns can also be seen in the data of other participants.

We haven’t looked any deeper yet because we think this is already cause for serious concern. It looks a lot like heavily altered or even fabricated data, and we suspect that as we look closer, we will find more red flags. Normally we wouldn’t bother but given that this is from the NIH, it seemed like it was worth looking into.

What do you think? Does this look equally suspicious to you?

He and his sister Sarah followed up with a post, also there are posts by Nick Brown (“Some apparent problems in a high-profile study of ultra-processed vs unprocessed diets”) and Ivan Oransky (“NIH researcher responds as sleuths scrutinize high-profile study of ultra-processed foods and weight gain”).

I don’t really have anything to add on this one. Statistics is hard, data analysis is hard, and when research is done on an important topic, it’s good to have outsiders look at it carefully. So good all around, whatever happens with this particular story.

Here are some ways of making your study replicable. (No, the first steps are not preregistration or increasing the sample size!)

John Protzko, Jon Krosnick, Leif Nelson, Brian Nosek, Jordan Axt, Matthew Berent, Nick Buttrick, Matthew DeBell, Charles Ebersole, Sebastian Lundmark, Bo MacInnis, Michael O’Donnell, Hannah Perfecto, James Pustejovsky, Scott Roeder, Jan Walleczek, and Jonathan Schooler write:

Failures to replicate evidence of new discoveries have forced scientists to ask whether this unreliability is due to suboptimal implementation of methods or whether presumptively optimal methods are not, in fact, optimal. This paper reports an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . . [italics added]

Don’t get me wrong, I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern with the italicized bit is that I’m kind of concerned that the authors or readers of this article will think that these are the best rigor-enhancing practices in science, or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Replicability is great, and for the present discussion (although not in general) I’m ok with roughly equating “rigor” with “replicability,” so that “rigor-enhancing practices” would be those that make a study more likely to be replicable.

What, then, are the first steps that I would recommend to make a study more likely to be replicable? Here’s my quick list, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly. This is related to the “methodological transparency” of Protzko et al., but here I’m not talking about statistical methods, I’m talking about scientific methods. What exactly did you do in the lab or the field, where did you get your participants, where and when did you work with them, etc.?

2. Increase your effect size, e.g., do a more effective treatment. This might sound kind of obvious, but consider all the studies where, from theoretical grounds, one would expect very small effects, for example studies of subliminal messages (or notoriously, here).

3. Focus your study on the people and scenarios where effects are likely to be largest. For example, in a education experiment, it could be that the best-prepared students don’t need the intervention, and the students who are the least well prepared ahead of time can’t get much benefit from it; thus, to maximize the average treatment effect in your study, you should perform it on students in the middle of the range. Similarly, try to set up the conditions of your experiment so that the treatment effect will be as large as possible.

4. Improve your outcome measurement: a more focused and less variable outcome measure should result in a lower standard error of estimation.

5. Improve pre-treatment measurements; adjusting for these in your analysis should reduce your uncertainty in estimating average effects. Often a good way to improve outcome measurements and pre-treatment measurements is to take more measurements on each person, either asking more questions at the time of the study or doing more followups.

6. The methods listed in the above-linked article: “confirmatory tests, large sample sizes, preregistration, and methodological transparency.” Large sample size is ok, but in general I think it makes sense to get more data and learn more from each participants, rather than just dragging more people into a bare-bones study. As for the others: Sure, making preregistered predictions is fine, why not, but more as a way of uncovering potential problems with your study than by directly making it better. I don’t mind the four steps listed in in the linked article, but I really don’t like these as the first four characteristics—indeed the only four characteristics—of rigor in the social and behavioral sciences.

In short, there are ways of increasing statistical power, other than brute-force increasing your sample size.

These are general questions that come up in many studies; see here for a quick overview.

P.S. I’m a big fan of Brian Nosek and the whole replication movement. I wonder whether one of the problems with the above-linked article is that it’s a short piece with a zillion authors. The result can be a diffusion of responsibility where nobody gets around to checking to see if everything makes complete sense.

Social experiments and the often-neglected role of theory

Jason Collins discusses a paper by Milkman et al. that presented “a megastudy testing 54 interventions to increase the gym visits of 61,000 experimental participants.” Some colleagues and I discussed that paper awhile ago—I think we were planning to write something up about it but I don’t remember what happened with that.

As I recall, the study had two main results. First, researchers had overestimated the effects of various interventions: basically, people thought they had great ides for increasing fitness participation, but the real world is complicated, and most things don’t work as effectively as you might expect. Second, regarding the interventions themselves, evidence was mixed: even in a large study it can be hard to detect average effects

The overestimation of effect sizes is consistent with things we’ve seen before in other areas of policy research. Past literature tends to report inflated effect sizes: the statistical significance filter, both within and between studies, leads to a selection bias which is always a concern but particularly so when improvements are incremental (there is no magic bullet that will get people to the gym). Beyond this, the effects we typically envision are the effects when the treatment is effective. When considering the average treatment effect, we’re also averaging over all those people for whom the effect is near zero, as illustrated in Figure 1d of this paper.

The big problem: Where do the interventions come from?

Collins’s discussion seems reasonable to me. In particular, I agree with his big problem about the design of this “mega-study,” which is that there’s all sorts of rigor in the randomization and analysis plan, but no rigor at all when it comes to deciding what interventions to test.

Unfortunately, this is standard practice in policy analysis! Indeed, if you look at a statistics book, including mine, you’ll see lots and lots on causal inference and estimation, but nothing on how to come up with the interventions to study in the first place.

Here’s how Collins puts it:

At first glance, the list of 54 interventions suggests the megastudy has an underlying philosophy of “throw enough things at a wall and surely something will stick”. . . .

Fair enough. But this concession implicitly means the authors have given up on developing an understanding of human decision making that might allow us to make predictions. Each hypothesis or set of hypotheses they tested concern discrete empirical regularities. They are not derived from or designed to test a core model of human decision making. We have behavioural scientists working as technicians, seeking to optimise a particular objective with the tools at hand. . . .

A big problem here is that “tools at hand” are not always so good, especially when these tools are themselves selected based on past noisy and biased evaluations. What are those 54 interventions, anyway? Just some things that a bunch of well-connected economists wanted to try out. Well-connected economists know lots of things, but maybe not so much about motivating people to go to the gym.

A related problem is variation: These treatments, even when effective, are not simply push-button-X-and-then-you-get-outcome-Y. Effects will be zero for most people and will be highly variable among the people for whom effects are nonzero. The result is that the average treatment effect will be much smaller than you expect. This is not just a problem of “statistical power”; it’s also a conceptual problem with this whole “reduced-form” way of looking at the world. To put it another way: Lack of good theory has practical consequences.

Principal stratification for vaccine efficacy (causal inference)

Rob Trangucci, Yang Chen, and Jon Zelner write:

In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials.

Sounds important. And they use Stan, which always makes me happy.

Colonoscopy corner: Misleading reporting of intent-to-treat analysis

Dale Lehman writes:

I’m probably the 100th person that has sent this to you: here is the NEJM editorial and here is the study.

The underlying issue, which has been a concern of mine for some time now, is the usual practice of basing analysis on “intention to treat” rather than on “treatment per protocol.” In the present case, randomized assignment into groups “invited” to have a colonoscopy and those not, resulted in a low percentage actually following the advice. Based on intention to treat, the benefits of colonoscopy appear to be small (or potentially none). Based on those actually receiving the colonoscopy, the effectiveness appears quite large. While the editorial accurately describes the results, it seems far less clear than it could/should be. The other reasons why this latest study may differ from prior ones are valid (effectiveness of the physicians, long-term follow up, etc.) but pale in importance with the obvious conclusion that when adherence is low, effectiveness is thwarted. As the editorial states “screening can be effective only if it is performed.” I think that should be the headline and that is what the media reports should have focused on. Instead, the message is mixed at best – leading some headlines to suggest that the new study raises questions about whether or not colonoscopies are effective (or cost-effective).

The correct story does come though if you read all the stories but I think the message is far more ambiguous than it should be. Intention to treat is supposed to reflect real world practice whereas treatment per protocol is more of a best-case analysis. But when the difference (and the adherence rate here, less than 50%) is so low, then the most glaring result of this study should be that increasing adherence is of primary importance (in my opinion). Instead, there is a mixed message. I don’t even think the difference can be ascribed to the difference in audiences. Intention to treat may be appropriate for public health practitioners whereas the treatment per protocol might be viewed as appropriate for individual patients. However, in this case it would seem relatively costless to invite everyone in the target group to have a colonoscopy, even if less than half will do so. Actually, I think the results indicate that much more should be done to improve adherence, but at a minimum I see little justification for not inviting everyone in the target group to get a colonoscopy. I don’t see how this study casts much doubt on those conclusions, yet the NEJM and the media seem intent on mixing the message.

In fact, Dale was not the 100th person who had sent this to me or even the 10th person. He was the only one, and I had not heard about this story. I’m actually not sure how I would’ve heard about it . . .

Anyway, I quickly looked at everything and I agree completely with Dale’s point. For example, the editorial says:

In the intention-to-screen analysis, colonoscopy was found to reduce the risk of colorectal cancer over a period of 10 years by 18% (risk ratio, 0.82; 95% confidence interval [CI], 0.70 to 0.93). However, the reduction in the risk of death from colorectal cancer was not sig- nificant (risk ratio, 0.90; 95% CI, 0.64 to 1.16).

I added the boldface above. What it should say there is not “colonoscopy” but “encouragement to colonoscopy.” Just two words, but a big deal. There’s nothing wrong with an intent-to-treat analysis, but then let’s be clear: it’s measuring the intent to treat, not the treatment itself.

P.S. Relatedly, I received this email from Gerald Weinstein:

Abuse of the Intention to Treat Principle in RCTs has led to some serious errors in interpreting such studies. The most absurd, and possibly deadly example is a recent colonoscopy study which was widely reported as “Screening Procedure Fails to Prevent Colon Cancer Deaths in a Gold-standard Study,” despite the fact that only 42% of the colonoscopy group actually underwent the procedure. My concern is far too many people will interpret this study as meaning “colonoscopy doesn’t work.”

It seems some things don’t change, as I had addressed this issue in a paper written with your colleague, Bruce Levin, in 1985 (Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548). I am a retired cardiac surgeon who has had to deal with similar misguided studies during my long career.

The recent NEJM article “Effect of Colonoscopy Screening on Risks of Colorectal Cancer and Related Death” showed only an 18% reduction in death in the colonoscopy group which was not statistically significant and was widely publicized in the popular media with headlines such as “Screening Procedure Fails to Prevent Colon Cancer Deaths in Large Study.”

In fact, the majority of people in the study group did not undergo colonoscopy, but were only *invited* to do so, with only 42% participating. How can colonoscopy possibly prevent cancer in those who don’t under go it? Publishing such a study is deeply misguided and may discourage colonoscopy, with tragic results.

Consider this: If someone wanted to study attending a wedding as a superspreader event, but included in the denominator all those who were invited, rather than those who attended, the results would be rendered meaningless by so diluting the case incidence as to lead to the wrong conclusion.

My purpose here is not merely to bash this study, but to point out difficulties with the “Intention to Treat” principle, which has long been a problem with randomized controlled studies (RCTs). The usefulness of RCTs lies in the logic of comparing two groups, alike in every way *except* for the treatment under study, so any differences in outcome may be imputed to the treatment. Any violation of this design can invalidate the study, but too often, such studies are assumed to be valid because they have the structure of an RCT.

There are several ways a clinical study can depart from RCT design: patients in the treatment group may not actually undergo the treatment (as in the colonoscopy study) or patients in the control group may cross over into the treatment group, yet still be counted as controls, as happened in the Coronary Artery Surgery Study (CASS) of the 1980s. Some investigators refuse to accept the problematic effects of such crossover and insist they are studying a “policy” of treatment, rather than the treatment itself. This concept, followed to its logical (illogical?) conclusion, leads to highly misleading trials, like the colonoscopy study.

P.P.S. I had a colonoscopy a couple years ago and it was no big deal, not much of an inconvenience at all.

Physics educators do great work with innovative teaching. They should do better in evaluating evidence of effectiveness and be more open to criticism.

Michael Weissman pointed me to a frustrating exchange he had with the editor of, Physical Review Physics Education Research. Weissman submitted an article criticizing an article that the journal had published, and the editor refused to publish his article. That’s fine—it’s the journal’s decision to decide what to publish!—but I agree with Weissman that some of the reasons they gave for not publishing were bad reasons, for example, “in your abstract, you describe the methods used by the researchers as ‘incorrect’ which seems inaccurate, SEM or imputation are not ‘incorrect’ but can be applied, each time they are applied, it involves choices (which are often imperfect). But making these choices explicit, consistent, and coherent in the application of the methods is important and valuable. However, it is not charitable to characterize the work as incorrect. Challenges are important, but PER has been and continues to be a place where people tend to see the positive in others.”

I would not have the patience to go even 5 minutes into these models with the coefficients and arrows, as I think they’re close to hopeless even in the best of settings and beyond hopeless for observational data, nor do I want to think too hard about terms such as “two-way correlation,” a phrase which I hope never to see again!

I agree with Weissman on these points:

1. It is good for journals to publish critiques, and I don’t think that critiques should be held to higher standards than the publications they are critiquing.

2. I think that journals are too focused on “novel contributions” and not enough on learning from mistakes.

3. Being charitable toward others is fine, all else equal, but not so fine if this is used as a reason for researchers, or an entire field, to avoid confronting the mistakes they have made or the mistakes they have endorsed. Here’s something I wrote in praise of negativity.

4. Often these disputes are presented as if the most important parties are authors of the original paper, the journal editor, and the author of the letter or correction note. But that’s too narrow a perspective. The most important parties are not involved in the discussion at all: these are the readers of the articles—those who will takes its claims and apply them to policy or to further researchers—and all the future students who may be affected by these policies. Often it seems that the goal is to minimize any negative career impact on the authors of the original paper and to minimize any inconvenience to the journal editors. I think that’s the wrong utility function, and to ignore the future impacts of uncorrected mistakes is implicitly an insult to the entire field. If the journal editors think the work they publish has value—not just in providing chits that help scholars get promotions and publicity, but in the world outside the authors of these articles—then correcting errors and learning from mistakes should be a central part of their mission.

I hope Weissman’s efforts in this area have some effect in the physics education community.

As a statistics educator, I’ve been very impressed by the innovation shown by physics educators (for example, the ideas of peer instruction and just-in-time teaching, which I use in my classes), so I hope they can do better in this dimension of evaluating evidence of effectiveness.

Knee meniscus repair: Is it useful? Potential biases in intent-to-treat studies.

Paul Kedrosky writes:

Not sure if you’ve written on this, but the orthopedic field is tying itself in knots trying to decide if meniscus repair is useful. The field thought it was, and then decided it wasn’t after some studies a decade ago, and is now mounting a rear-guard action via criticisms of the statistics of intent-to-treat studies.

He points to this article in the journal Arthroscopy, “Can We Trust Knee Meniscus Studies? One-Way Crossover Confounds Intent-to-Treat Statistical Methods,” by James Lubowitz, Ralph D’Agostino Jr., Matthew Provencher, Michael Rossi, and Jefferson Brand.

Hey, I know Ralph from grad school! So I’m inclined to trust this article. But I don’t know anything about the topic. Here’s how the article begins:

Randomized controlled studies have a high level of evidence. However, some patients are not treated in the manner to which they were randomized and actually switch to the alternative treatment (crossover). In such cases, “intent-to-treat” statistical methods require that such a switch be ignored, resulting in bias. Thus, the study conclusions could be wrong. This bias is a common problem in the knee meniscus literature. . . . patients who fail nonsurgical management can cross over and have surgery, but once a patient has surgery, they cannot go back in time and undergo nonoperative management. . . . the typical patient selecting to cross over is a patient who has more severe symptoms, resulting in failure of nonoperative treatment. Patients selecting to cross over are clearly different from the typical patient who does not cross over, because the typical patient who does not cross over is a patient who has less severe symptoms, resulting in good results of nonoperative treatment. Comparing patients with more severe symptoms to patients with less severe symptoms is biased.

Interesting.

That article is from 2016. I wonder what’s been learned since then? What’s the consensus now? Googling *knee meniscus surgery* yields this from the Cleveland Clinic:

Meniscus surgery is a common operation to remove or repair a torn meniscus, a piece of cartilage in the knee. The surgery requires a few small incisions and takes about an hour. Recovery and rehabilitation take a few weeks. The procedure can reduce pain, improve mobility and stability, and get you back to life’s activities. . . .

What follows is lots of discussions about the procedure, who should get it, and its risks and benefits. Nothing about any studies claiming that it doesn’t work. So in this case maybe the “rear-guard action” was correct, or at least successful so far.

In any case, this is a great example for thinking about potential biases in intent-to-treat studies.

Estimated effects of pre-K over the decades: From Cold War optimism to modern-day pessimism

Ethan Steinberg writes:

A while back, you briefly blogged a bit about a very nicely done pre-K RCT on results up to the third grade:

I thought you might be interested that the authors have just published their sixth grade results.

It looks like the negative effects found in the third grade analysis seem to have gotten stronger:

Data through sixth grade from state education records showed that the children randomly assigned to attend pre-K had lower state achievement test scores in third through sixth grades than control children, with the strongest negative effects in sixth grade

I think the really interesting question about this study is, if the study is correct, how were we so wrong previously? Someone posted the above plot from a 2013 article that really shows the difference in how much earlier pre-K studies differed from later pre-K studies.

Is this a change in types of pre-K studies being done, a change in the environment (maybe something about the internet really changed the effectiveness of pre-K?), or a publication bias issue?

I don’t know! It’s my impression that those old studies were so noisy as to be essentially useless for any quantitative purposes. It’s funny how that could happen. I’m guessing that all these designs included power analyses but with massively overoptimistic hypothesized effect sizes, which can happen if you don’t fully think through the implications of treatment effect heterogenity. Kinda scary to think of all this money, effort, and statistical analysis that was missing this basic point. To really understand this, we have to go back to the gung-ho Cold War mindset of the 1950s and 60s, the attitude that, with sufficient fortitude, all problems could be solved.

Postdoctoral position at MIT: privacy, synthetic data, fairness & causal inference

I have appreciated Jessica’s recent coverage of differential privacy and related topics on this blog — especially as I’ve also started working in this general area.

So I thought I’d share this new postdoc position that Manish Raghavan and I have here at MIT where it is an important focus. Here’s some of the description of the broad project area, which this researcher would help shape:

This research program is working to understand and advance techniques for sharing and using data while limiting what is revealed about any individual or organization. We are particularly interested in how privacy-preserving technologies interface with recent developments in high-dimensional statistical machine learning (including foundation models), questions about fairness of downstream decisions, and with causal inference. Applications include some in government and public policy (e.g., related to US Census Bureau data products) and increasing use in multiple industries (e.g., tech companies, finance).

While many people with relevant expertise might be coming from CS, we’re also very happy to get interest from statisticians — who have a lot to add here!

This post is by Dean Eckles.

How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”

Often these posts start with a question that someone sends to me and continue with my reply. This time the q-and-a goes the other way . . .

I pointed Erik van Zwet to this post, “I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”, and wrote:

This example (in particular, the regression analysis at the end of the PPS section) makes me think about your idea of a standard-error-scaled prior, this time for regression coefficients. What do you think?

Erik replied:

Yes, I did propose a default prior for regression coefficients:

Wow, 2019 seems so long ago now! This was before I had the nice Cochrane data, and started focusing on clinical trials. The paper was based on a few hundred z-values of regression coefficients which I collected by hand from Medline. I tried to do that in an honest way as follows:

It is a fairly common practice in the life sciences to build multivariate regression models in two steps. First, the researchers run a number of univariate regressions for all predictors that they believe could have an important effect. Next, those predictors with a p-value below some threshold are selected for the multivariate model. While this approach is statistically unsound, we believe that the univariate regressions should be largely unaffected by selection on significance, simply because that selection is still to be done!

Anyway, using a standard-error-scaled prior really means putting in prior information about the signal-to-noise ratio. The study with the brain activity in babies seems to have a modest sample size relative to the rather noisy outcome. So I would expect regression coefficients with z-values between 1 and 3 to be inflated, and an Edlin factor of 1/2 seems about the right ball park.

I think that type M errors are a big problem, but I also believe that the probability of a type S error tends to be quite small. So, if I see a more or less significant effect in a reasonable study, I would expect the direction of the effect to be correct.

I just want to add one thing here, which is in that example the place where I wanted to apply the Edlin factor was on the control variables in the regression, where I was adjusting for pre-treatment predictions. The main effects in this example show no evidence of being different from what could be expected from pure chance.

This discussion is interesting in revealing two different roles of shrinkage. One role is what Erik is focusing on, which is shrinkage of effects of interest, which as he notes should generally have the effect of making the magnitudes of estimated effects smaller without changing their sign. The other role is shrinkage of coefficients of control variables, which regularizes these adjustments, which indirectly give more reasonable estimates of the effects of interest.

Climate change makes the air hotter, thus less dense, leading to more home runs.

Roxana Dior writes:

From this news article, I found out about this paper, “Global warming, home runs, and the future of America’s pastime,” by Christopher Callahan, Nathaniel Dominy, Jeremy DeSilva, and Justin Mankin, which suggests home runs in baseball have become more numerous in recent years due to climate change, and will be scored more frequently in the future as temperatures rise.

Apart from the obvious question—when will Moneyball obsessed general managers look at optimizing stadium air density when their team is at bat?—is this a statistically sound approach? I am no baseball aficionado, but air density changes due to temperature seems like it would have a miniscule affect on home runs scored, as I assume that the limiting factor in scoring one is the “cleanness of contact” with the bat, and that most batters hit the ball with sufficient power to clear the boundary when they do. There are probably a hundred other confounding variables to consider, such as PED usage etc. but the authors seem confident in their approach.

They end with:

More broadly, our findings are emblematic of the widespread influence anthropogenic global warming has already had on all aspects of life. Warming will continue to burden the poorest and most vulnerable among us, altering the risks of wildfires, heat waves, droughts, and tropical cyclones (IPCC, 2022). Our results point to the reality that even the elite billion-dollar sports industry is vulnerable to unexpected impacts.

I think I agree with the sentiment, but this feels like a bit of a reach, no?

From the abstract to the paper, which recently appeared in the Bulletin of the American Meteorological Society:

Home runs in baseball—fair balls hit out of the field of play—have risen since 1980, driving strategic shifts in gameplay. Myriad factors likely account for these trends, with some speculating that global warming has contributed via a reduction in ballpark air density. Here we use observations from 100,000 Major League Baseball games and 220,000 individual batted balls to show that higher temperatures substantially increase home runs. We isolate human-caused warming with climate models, finding that >500 home runs since 2010 are attributable to historical warming. . . .

My first thought on all this is . . . I’m not sure! As Dior writes, a change of 1 degree won’t do much—it should lower the air density by a factor 1/300, which isn’t much. The article claims that a 1 degree C rise in temperature is associated with a 2% rise in the number of home runs. On the other hand, it doesn’t take much to turn a long fly ball into a homer, so maybe a 1/300 decrease in air density is enough to do it.

OK, let’s think about this one. The ball travels a lot farther in Denver, where the air is thinner. A quick Google tells us that the air pressure in Denver is 15% lower than at sea level.

So, if it’s just air pressure, the effect of 1 degree heating would be about 1/50 of the effect of going from sea level to Denver. And what would that be? A quick Google turns up this page by physicist Alan Nathan from 2007, which informs us that:

There is a net force on the ball that is exactly opposite to its direction of motion. This force is call the drag force, although it is also commonly referred to as “air resistance”. The drag plays an extremely important role in the flight of a fly ball. For example, a fly ball that carries 400 ft would carry about 700 ft if there were no drag. The drag plays a less significant — but still important — role in the flight of a pitched baseball. Roughly speaking, a baseball loses about 10% of its speed during the flight between pitcher and catcher, so that a baseball that leaves the pitcher’s hand at 95 mph will cross the plate at about 86 mph. If the baseball is also spinning, it experiences the Magnus force, which is responsible for the curve or “break” of the baseball. . . .

Both the drag and Magnus forces . . . are proportional to the density of the air. . . . the air density in Denver (5280 ft) is about 82% of that at sea level. . . . the drag and Magnus forces in Coors will be about 82% of their values at Fenway.

What about the effect of altitude? Here’s Nathan again:

The reduced drag and Magnus forces at Coors will have opposite effects fly balls on a typical home run trajectory. The principal effect is the reduced drag, which results in longer fly balls. A secondary effect is the reduced Magnus force. Remember that the upward Magnus force on a ball hit with backspin keeps it in the air longer so that it travels farther. Reducing the Magnus force therefore reduces the distance. However, when all is said and done, the reduced drag wins out over the reduced Magnus force, so that fly balls typically travel about 5% farther at Coors than at Fenway, all other things equal. . . . Therefore a 380 ft drive at Fenway will travel nearly 400 ft at Coors. . . .

Also, Nathan says that when the ball is hotter and the air is dryer, the ball is bouncier and comes faster off the bat.

The next question is how will this affect the home run total. Ignoring the bouncy-ball thing, we’d want to know how many fly balls are close enough to being a home run that an extra 20 feet would take them over the fence.

I’m guessing the answer to this question is . . . a lot! As a baseball fan, I’ve seen lots of deep fly balls.

And, indeed, at this linked post, Nathan reports the result an analysis of fly balls and concludes:

For each 1 ft reduction in the fly-ball distance, the home-run probability is reduced by 2.3 percent.

So making the air thinner so that the ball goes 20 feet farther should increase the home run rate by about 46%. Or, to go back to the global-warming thing, 1/50th of this effect should increase the home run rate by about 1%. This is not quite the 2% that was claimed in the recent paper that got all this publicity, but (a) 2% isn’t far from 1%, indeed given that 1% is the result from a simple physics-based analysis, 2% is not an unreasonable or ridiculous empirical claim; (b) the 1% just came from the reduced air pressure, not accounting for a faster speed off the bat; (c) the 1% was a quick calculation, not directly set up to answer the question at hand.

And . . . going to Nathan’s site, I see he has an updated article on the effect of temperature on home run production, responding to the new paper by Callahan et al. He writes that in 2017 he estimated that a 1 degree C increase in temperature “results in 1.8% more home runs.” Nathan’s 2017 paper did this sort of thing:

I don’t like the double y-axis, but my real point here is just that he was using actual trajectory data to get a sense of how many balls were in the window of being possibly affected by a small rise in distance traveled.

Callahan et al. don’t actually refer to Nathan’s 2017 paper or the corresponding 1.8% estimate, which is too bad because that would’ve made their paper much stronger! Callahan et al. run some regressions, which is fine, but I find the analysis based on physics and ball trajectories much more convincing. And I find the combination of analyses even more convincing. Unfortunately, Callahan et al. didn’t do as much Googling as they should’ve, so they didn’t have access to that earlier analysis! In his new article, Nathan does further analysis and continues to estimate that a 1 degree C increase in temperature results in 1.8% more home runs.

So, perhaps surprisingly, our correspondent’s intuition was wrong: a small change in air density really can have noticeable effect here. In another way, though, she’s kinda right, in that affects of warming are only a small part of what is happening in baseball.

Relevance to global warming

The home runs example is kinda goofy, but, believe it or not, I do think this example is relevant to more general concerns about global warming. Not because I care about the sanctity of baseball—if you got too many home runs, just un-juice the ball, or reduce the length of the game to 8 innings, or make them swing 50-ounce bats, or whatever—but because it illustrates how a small average change can make a big change on the margin. In this case, it’s all those balls that are close to the fence but don’t quite make it over. The ball going 5% farther corresponds to a lot more than a 5% increase of home runs.

Elasticities are typically between 0 and 1, so it’s interesting to see this example where the elasticity is much greater than 1. In the baseball example, I guess that one reason there are so many fly balls that are within 20 feet of being home runs, is that batters are trying so hard to hit it over the fence, and often they come close when they don’t succeed. The analogy to environmental problems is that much of agriculture and planning is on the edge in some way—using all the resources currently available, building right up to the coast, etc.—so that even small changes in the climate can have big effects.

I’m not saying the baseball analysis proves any of this, just that it’s a good example of the general point, an example we can all understand by thinking about those batted balls (a point that is somewhat lost in the statistical analysis in the above-linked paper).

How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer

I happened to receive two emails on the same day on two different topics, both relating to how much to trust claims published in the medical literature.

1. Someone writes:

This is the follow up publication for the paper that was retracted from preprint servers a few months ago, the language has changed but the results are the same: patients treated with cacifediol had a much lower mortality rate than patients who were not treated:

This follows three other papers on the same therapy which found the same results:

Small pilot RCT
Large propensity matched study
Cohort trial of 574 patients

I continue to be bewildered that this therapy has been ignored given that it’s so safe with such a high upside.

This led me to an interesting question which I thought you may have an answer for: “What are the most costly Type II errors in history?”

2. Someone else writes:

Do you think these two studies are flawed?

Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial
Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial

I said that I don’t know, I’ve never heard of this topic before. Why do you think they might be flawed?

And my correspondent replied:

I don’t understand the nested case cohort design but a very senior presenter at our Grand Rounds mentioned the studies were flawed. He didn’t go into the details as his topic was entirely different. I am trying to understand whether fish oil leads to increased risk for prostate cancer. I take fish oil myself but these studies shake my confidence, although they may be flawed studies.

I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature.

An alternative approach is to look for trusted sources on the internet, but that’s not always so helpful either. For example, when I google *cleveland clinic vitamin d covid*, the first hit is an article, Can Vitamin D Prevent COVID-19?, which sounds relevant but then I notice that the date is 18 May 2020. Lots has been learned about covid since then, no?? I’m not trying to slam the Cleveland Clinic here, just saying that it’s hard to know where to look. I trust my doctor, which is fine, but (a) not everyone has a primary care doctor, and (b) in any case, doctors need to get their information from somewhere too.

I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.

P.S. Just to clarify one point: In the above post I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers. So I appreciate the feedback in the comments section.

Association between low density lipoprotein cholesterol and all-cause mortality

Larry Gonick asks what I think of this research article, Association between low density lipoprotein cholesterol and all-cause mortality: results from the NHANES 1999–2014.

The topic is relevant to me, as I’ve had cholesterol issues. And here’s a stunning bit from the abstract:

We used the 1999–2014 National Health and Nutrition Examination Survey (NHANES) data with 19,034 people to assess the association between LDL-C level and all-cause mortality. . . . In the age-adjusted model (model 1), it was found that the lowest LDL-C group had a higher risk of all-cause mortality (HR 1.7 [1.4–2.1]) than LDL-C 100–129 mg/dL as a reference group. The crude-adjusted model (model 2) suggests that people with the lowest level of LDL-C had 1.6 (95% CI [1.3–1.9]) times the odds compared with the reference group, after adjusting for age, sex, race, marital status, education level, smoking status, body mass index (BMI). In the fully-adjusted model (model 3), people with the lowest level of LDL-C had 1.4 (95% CI [1.1–1.7]) times the odds compared with the reference group, after additionally adjusting for hypertension, diabetes, cardiovascular disease, cancer based on model 2. . . . In conclusion, we found that low level of LDL-C is associated with higher risk of all-cause mortality.

The above quotation is exact except that I rounded all numbers to one decimal place. The original version presented them to three decimals (“1.708,” etc.) and that made me cry.

In any case, the finding surprised me. I don’t know that it’s actually a medical surprise; I just had the general impression that cholesterol is a bad thing to have. Also, I was gonna say I was surprised that the estimated effects were so large, but then I saw the large widths of the confidence intervals, and that surprised me too at first, but then I realized that not so many people in the longitudinal study would have died during the period, so the effective sample size isn’t quite as large as it might seem at first.

The researchers also fit some curves:

Next, the inferences that the curve came from:

The data are consistent with high risks at low cholesterol levels and nothing happening at high levels, also consistent with other patterns, as can be seen from the uncertainty lines.

The published paper does a good job of presenting data and conclusions clearly without any overclaiming that I can see.

Anyway, I don’t really know what to make of this study, and I know nothing about the literature in the area. I’ll still go by my usual algorithm and just trust my doctor on everything.

I’m posting because (a) I just think it’s cool that the author of the Cartoon Guide to Statistics reads our blog, and (b) it can be helpful to our readers to see an example of my ignorance.

“Risk ratio, odds ratio, risk difference… Which causal measure is easier to generalize?”

Anders Huitfeldt writes:

Thank you so much for discussing my preprint on effect measures (“Count the living or the dead?”) on your blog! I really appreciate getting as many eyes as possible on this work; having it highlighted on by you is the kind of thing that can really make the snowball start rolling towards getting a second chance in academia (I am currently working as a second-year resident in addiction medicine, after exhausting my academic opportunities)

I just wanted to highlight a preprint that was released today by Bénédicte Colnet, Julie Josse, Gaël Varoquaux, and Erwan Scornet. To me, this preprint looks like it might become an instant classic. Colnet and her coauthors generalize my thought process, and present it with much more elegance and sophistication. It is almost something I might have written if I had an additional standard deviation in IQ, and if I was trained in biostatistics instead of epidemiology.

The article in question begins:

From the physician to the patient, the term effect of a drug on an outcome usually appears very spontaneously, within a casual discussion or in scientific documents. Overall, everyone agrees that an effect is a comparison between two states: treated or not. But there are various ways to report the main effect of a treatment. For example, the scale may be absolute (e.g. the number of migraine days per month is expected to diminishes by 0.8 taking Rimegepant) or relative (e.g. the probability of having a thrombosis is expected to be multiplied by 3.8 when taking oral contraceptives). Choosing one measure or the other has several consequences. First, it conveys a different impression of the same data to an external reader. . . . Second, the treatment effect heterogeneity – i.e. different effects on sub-populations – depends on the chosen measure. . . .

Beyond impression conveyed and heterogeneity captured, different causal measures lead to different generalizability towards populations. . . . Generalizability of trials’ findings is crucial as most often clinicians use causal effects from published trials (i) to estimate the expected response to treatment for a specific patient . . .

This is indeed important, and it relates to things that people have been thinking about for awhile recently regarding varying treatment effects. Colnet et al. point out that, even if effects are constant on one scale, they will vary on other scales. In some sense, this hardly matters given that we can expect effects to vary on any scale. Different scales correspond to different default interpretations, which fits the idea that the choice of transformation is as much a matter of communication as of modeling. In practice, though, we use default model classes, and so parameterization can make a difference.

The new paper by Colnet et al. is potentially important because, as they point out, there remains a lot of confused thinking on the topic, both in theory and in practice, and I think part of the problem is a traditional setup in which there is a “treatment effect” to be estimated. In applied studies, you’ll often see this as a coefficient in a model. But, as Colnet et al. point out, if you take that coefficient as estimated from study A and use it to generalize to study B, you’ll be making some big assumptions. Better to get those assumptions out in the open and consider how the effect can vary.

As we discussed a few years ago, the average causal effect can be defined in any setting, but it can be misleading to think of it as a “parameter” to be estimated, as in general it can depend strongly on the context where it is being studied.

Finally, I’d like to again remind readers of our recent article, Causal quartets: Different ways to attain the same average treatment effect (blog discussion here), which discusses the many different ways that an average causal effect can manifest itself in the context of variation:

As Steve Stigler paper pointed out, there’s nothing necessarily “causal” about the content of our paper, or for that matter of the Colnet et al. paper. In both cases, all the causal language could be replaced by predictive language and the models and messages would be unchanged. Here is what we say in our article:

Nothing in this paper so far requires a causal connection. Instead of talking about heterogeneous treatment effects, we could just as well have referred to variation more generally. Why, then, are we putting this in a causal framework? Why “causal quartets” rather than “heterogeneity quartets”?

Most directly, we have seen the problem of unrecognized heterogeneity come up all the time in causal contexts, as in the examples in [our paper], and not so much elsewhere. We think a key reason is that the individual treatment effect is latent. So it’s not possible to make the “quartet” plots with raw data. Instead, it’s easy for researchers to simply assume the causal effect is constant, or to not think at all about heterogeneity of causal effects, in a way that’s harder to do with observable outcomes. It is the very impossibility of directly drawing the quartets that makes them valuable as conceptual tools.

So, yes, variation is everywhere, but in the causal setting, where at least half of the potential outcomes are unobserved, it’s easier for people to overlook variation or to use models where it isn’t there, such as the default model of a constant effect (on some scale or another).

It can be tempting to assume a constant effect, maybe because it’s simpler or maybe because you haven’t thought too much about it or maybe because you think that, in the absence of any direct data on individual causal effects, it’s safe to assume the effect doesn’t vary. But, for reasons discussed in the various articles above, assuming constant effects can be misleading in many different ways. I think it’s time to move off of that default.

What does it take, or should it take, for an empirical social science study to be convincing?

A frequent correspondent sends along a link to a recently published research article and writes:

I saw this paper on a social media site and it seems relevant given your post on the relative importance of social science research. At first, I thought it was an ingenious natural experiment, but the more I looked at it, the more questions I had. They sure put a lot of work into this, though, evidence of the subject’s importance.

I’m actually not sure how bad the work is, given that I haven’t spent much time with it. But the p values are a bit overdone (understatement there). And, for all the p-values they provide, I thought it was interesting that they never mention the R-squared from any of the models. I appreciate the lack of information the R-squared would provide, but I am always interested to know if it is 0.05 or 0.70. Not a mention. They do, however, find fairly large effects – a bit too large to be believable I think.

I didn’t have time to look into this one so I won’t actually link to the linked paper; instead I’ll give some general reactions.

There’s something about that sort of study that rubs me the wrong way and gives me skepticism, but, as my correspondent says, the topic is important so it makes sense to study it. My usual reaction to such studies is that I want to see the trail of breadcrumbs, starting from time series plots of local and aggregate data and leading to the conclusions. Just seeing the regression results isn’t enough for me, no matter how many robustness studies are attached to it. Again, this does not mean that the conclusions are wrong or even that there’s anything wrong with the researchers are doing; I just think that the intermediate steps are required to be able to make sense of this sort of analysis of limited historical data.

Haemoglobin blogging

Gavin Band writes:

I wondered what you (or your readers) make of this. Some points that might be of interest:

– The effect we discover is massive (OR > 10).
– The number of data points supporting that estimate is not *that* large (Figure 2).
– it can be thought of as a sort of collider effect – (human and parasite genotypes affecting disease status, which we ascertain on) – though I haven’t figured whether it’s really useful to think of it that way.
– It makes use of Stan! (Albeit only in a relatively minor way in Figure 2).

All in all it’s a pretty striking signal and I wondered what a stats audience make of this – maybe it’s all convincing, or maybe there are things we’ve overlooked or could have done better? I’d certainly be interested in any thoughts…

The linked article is called “The protective effect of sickle cell haemoglobin against severe malaria depends on parasite genotype,” and I have nothing to say about it, as I’ve always found genetics to be very intimidating! But I’ll share with all of you.

Reconciling evaluations of the Millennium Villages Project

Shira Mitchell, Jeff Sachs, Sonia Sachs, and I write:

The Millennium Villages Project was an integrated rural development program carried out for a decade in 10 clusters of villages in sub-Saharan Africa starting in 2005, and in a few other sites for shorter durations. An evaluation of the 10 main sites compared to retrospectively chosen control sites estimated positive effects on a range of economic, social, and health outcomes (Mitchell et al. 2018). More recently, an outside group performed a prospective controlled (but also nonrandomized) evaluation of one of the shorter-duration sites and reported smaller or null results (Masset et al. 2020). Although these two conclusions seem contradictory, the differences can be explained by the fact that Mitchell et al. studied 10 sites where the project was implemented for 10 years, and Masset et al. studied one site with a program lasting less than 5 years, as well as differences in inference and framing. Insights from both evaluations should be valuable in considering future development efforts of this sort. Both studies are consistent with a larger picture of positive average impacts (compared to untreated villages) across a broad range of outcomes, but with effects varying across sites or requiring an adequate duration for impacts to be manifested.

I like this paper because we put a real effort into understanding why two different attacks on the same problem reached such different conclusions. A challenge here was that one of the approaches being compared was our own! It’s hard to be objective about your own work, but we tried our best to step back and compare the approaches without taking sides.

Some background is here:

From 2015: Evaluating the Millennium Villages Project

From 2018: The Millennium Villages Project: a retrospective, observational, endline evaluation

Full credit to Shira for pushing all this through.