Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Our new book, Active Statistics, is now available!

Coauthored with Aki Vehtari, this new book is lots of fun, perhaps the funnest I’ve ever been involved in writing. And it’s stuffed full of statistical insights. The webpage for the book is here, and the link to buy it is here or directly from the publisher here.

With hundreds of stories, activities, and discussion problems on applied statistics and causal inference, this book is a perfect teaching aid, a perfect adjunct to a self-study program, and an enjoyable bedside read if you already know some statistics.

Here’s the quick summary:

This book provides statistics instructors and students with complete classroom material for a one- or two-semester course on applied regression and causal inference. It is built around 52 stories, 52 class-participation activities, 52 hands-on computer demonstrations, and 52 discussion problems that allow instructors and students to explore the real-world complexity of the subject. The book fosters an engaging “flipped classroom” environment with a focus on visualization and understanding. The book provides instructors with frameworks for self-study or for structuring the course, along with tips for maintaining student engagement at all levels, and practice exam questions to help guide learning. Designed to accompany the authors’ previous textbook Regression and Other Stories, its modular nature and wealth of material allow this book to be adapted to different courses and texts or be used by learners as a hands-on workbook.

It’s got 52 of everything because it’s structured around a two-semester class, with 13 weeks per semester and 2 classes per week. It’s really just bursting with material, including some classic stories and lots of completely new material. Right off the bat we present a statistical mystery that arose with a Wikipedia experiment, and we have a retelling of the famous Literary Digest survey story but with a new and unexpected twist (courtesy of Sharon Lohr and J. Michael Brick). And the activities have so much going on! One of my favorites is a version of the two truths and a lie game that demonstrates several different statistical ideas.

People have asked how this differs from my book with Deborah Nolan, Teaching Statistics: A Bag of Tricks. My quick answer is that Active Statistics has a lot more of everything, it’s structured to cover an entire two-semester course in order, and it’s focused on applied statistics. Including a bunch of stories, activities, demonstrations, and problems on causal inference, a topic that is not always so well integrated into the statistics curriculum. You’re gonna love this book.

You can buy it here or here. It’s only 25 bucks, which is an excellent deal considering how stuffed it is with useful content. Enjoy.

When Steve Bannon meets the Center for Open Science: Bad science and bad reporting combine to yield another ovulation/voting disaster

The Kangaroo with a feather effect

A couple of faithful correspondents pointed me to this recent article, “Fertility Fails to Predict Voter Preference for the 2020 Election: A Pre-Registered Replication of Navarrete et al. (2010).”

It’s similar to other studies of ovulation and voting that we’ve criticized in the past (see for example pages 638-640 of this paper.

A few years ago I ran across the following recommendation for replication:

One way to put a stop to all this uncertainty: preregistration of studies of all kinds. It won’t quell existing worries, but it will help to prevent new ones, and eventually the truth will out.

My reaction was that this was way too optimistic.The ovulation-and-voting study had large measurement error, high levels of variation, and any underlying effects were small. And all this is made even worse because they were studying within-person effects using a between-person design. So any statistically significant difference they find is likely to be in the wrong direction and is essentially certain to be a huge overestimate. That is, the design has a high Type S error rate and a high Type M error rate.

And, indeed, that’s what happened with the replication. It was a between-person comparison (that is, each person was surveyed at only one time point), there was no direct measurement of fertility, and this new study was powered to only be able to detect effects that were much larger than would be scientifically plausible.

The result: a pile of noise.

To the authors’ credit, their title leads right off with “Fertility Fails to Predict . . .” OK, not quite right, as they didn’t actually measure fertility, but at least they foregrounded their negative finding.

Bad Science

Is it fair for me to call this “bad science”? I think this description is fair. Let me emphasize that I’m not saying the authors of this study are bad people. Remember our principle that honesty and transparency are not enough. You can be of pure heart, but if you are studying a small and highly variable effect using a noisy design and crude measurement tools, you’re not going to learn anything useful. You might as well just be flipping coins or trying to find patterns in a table of random numbers. And that’s what’s going on here.

Indeed, this is one of the things that’s bothered me for years about preregistered replications. I love the idea of preregistration, and I love the idea of replication. These are useful tools for strengthening research that is potentially good research and for providing some perspective on questionable research that’s been done in the past. Even the mere prospect of preregistered replication can be a helpful conceptual tool when considering an existing literature or potential new studies.

But . . . if you take a hopelessly noisy design and preregister it, that doesn’t make it a good study. Put a pile of junk in a fancy suit and it’s still a pile of junk.

In some settings, I fear that “replication” is serving a shiny object to distract people from the central issues of measurement, and I think that’s what’s going on here. The authors of this study were working with some vague ideas of evolutionary psychology, and they seem to be working under the assumption that, if you’re interested in theory X, that the way to science is to gather some data that have some indirect connection to X and then compute some statistical analysis in order to make an up-or-down decision (“statistically significant / not significant” or “replicated / not replicated”).

Again, that’s not enuf! Science isn’t just about theory, data, analysis, and conclusions. It’s also about measurement. It’s quantitative. And some measurements and designs are just too noisy to be useful.

As we wrote a few years ago,

My criticism of the ovulation-and-voting study is ultimately quantitative. Their effect size is tiny and their measurement error is huge. My best analogy is that they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

At some point, a set of measurements is so noisy that biases in selection and interpretation overwhelm any signal and, indeed, nothing useful can be learned from them. I assume that the underlying effect size in this case is not zero—if we were to look carefully, we would find some differences in political attitude at different times of the month for women, also different days of the week for men and for women, and different hours of the day, and I expect all these differences would interact with everything—not just marital status but also age, education, political attitudes, number of children, size of tax bill, etc etc. There’s an endless number of small effects, positive and negative, bubbling around.

Bad Reporting

Bad science is compounded by bad reporting. Someone pointed me to a website called “The National Pulse,” which labels itself as “radically independent” but seems to be an organ of the Trump wing of the Republican party, and which featured this story, which they seem to have picked up from the notorious sensationalist site, The Daily Mail:

STUDY: Women More Likely to Vote Trump During Most Fertile Point of Menstrual Cycle.

A new scientific study indicates women are more likely to vote for former President Donald Trump during the most fertile period of their menstrual cycle. According to researchers from the New School for Social Research, led by psychologist Jessica L Engelbrecht, women, when at their most fertile, are drawn to the former President’s intelligence in comparison to his political opponents. The research occurred between July and August 2020, observing 549 women to identify changes in their political opinions over time. . . .

A significant correlation was noticed between women at their most fertile and expressing positive opinions towards former President Donald Trump. . . . the 2020 study indicated that women, while ovulating, were drawn to former President Trump because of his high degree of intelligence, not physical attractiveness. . . .

As I wrote above, I think that research study was bad, but, conditional on the bad design and measurement, its authors seem to have reported it honestly.

The news report adds new levels of distortion.

– The report states that the study observed women “to identify changes in their political opinions over time.” First, the study didn’t “observe” anyone; they conducted an online survey. Second, they didn’t identify any changes over time: the women in the study were surveyed only once!

– The report says something about “a significant correlation” and that “the study indicated that . . .” This surprised me, given that the paper itself was titled, “Fertility Fails to Predict Voter Preference for the 2020 Election.” How do you get from “fails to predict” to “a significant correlation”? I looked at the journal article and found the relevant bit:

Results of this analysis for all 14 matchups appear in Table 2. In contrast to the original study’s findings, only in the Trump-Obama matchup was there a significant relationship between conception risk and voting preference [r_pb (475) = −.106, p = .021] such that the probability of intending to vote for Donald J. Trump rose with conception risk.

Got it? They looked at 14 comparisons. Out of these, one of these was “statistically significant” at the 5% level. This is the kind of thing you’d expect to see from pure noise, or the mathematical equivalent, which is a study with noisy measurements of small and variable effects. The authors write, “however, it is possible that this is a Type I error, as it was the only significant result across the matchups we analyzed,” which I think is still too credulous a way to put it; a more accurate summary would be to say that the data are consistent with null effects, which is no surprise given the realistic possible sizes of any effects in this very underpowered study.

The authors of the journal article also write, “Several factors may account for the discrepancy between our [lack of replication of] the original results.” They go on for six paragraphs giving possible theories—but never once considering the possibility that the original studies and theirs were just too noisy to learn anything useful.

Look. I don’t mind a bit of storytelling: why not? Storytelling is fun, and it can be a good way to think about scientific hypotheses and their implications. The reason we do social science is because we’re interested in the social world; we’re not just number crunchers. So I don’t mind that the authors had several paragraphs with stories. The problem is not that they’re telling stories, it’s that they’re only telling stories. They don’t ever reflect that this entire literature is chasing patterns in noise.

And this lack of reflection about measurement and effect size is destroying them! They went to all this trouble to replicate this old study, without ever grappling with that study’s fundamental flaw (see kangaroo picture at the top of this post). Again, I’m not saying that they authors are bad people or that they intend to mislead; they’re just doing bad, 2010-2015-era psychological science. They don’t know better, and they haven’t been well served by the academic psychology establishment which has promoted and continues to promote this sort of junk science.

Don’t blame the authors of the bad study for the terrible distorted reporting

Finally, it’s not the authors’ fault that their study was misreported by the Daily Mail and that Steve Bannon associated website. “Fails to Predict” is right there in the title of the journal article. If clickbait websites and political propagandists want to pull out that p = 0.02 result from your 14 comparisons and spin a tale around it, you can’t really stop them.

The Center for Open Science!

Science reform buffs will enjoy these final bits from the published paper:

Clinical trials that are designed to fail

Mark Palko points us to a recent update by Robert Yeh et al. of the famous randomized parachute-jumping trial:

Palko writes:

I also love the way they dot all the i’s and cross all the t’s. The whole thing is played absolutely straight.

I recently came across another (not meant as satire) study where the raw data was complete crap but the authors had this ridiculously detailed methods section, as if throwing in a graduate level stats course worth of terminology would somehow spin this shitty straw into gold.

Yeh et al. conclude:

This reminded me of my zombies paper. I forwarded the discussion to Kaiser Fung, who wrote:

Another recent example from Covid is this Scottish study. They did so much to the data that it is impossible for any reader to judge whether they did the right things or not. The data are all locked down for “privacy.”

Getting back to the original topic, Joseph Delaney had some thoughts:

I think the parachute study makes a good and widely misunderstood point. Our randomized controlled trial infrastructure is designed for the drug development world, where there is a huge (literally life altering) benefit to proving the efficacy of a new agent. Conservative errors are being cautious and nobody seriously considers a trial designed to fail as a plausible scenario.

But you see new issues with trials designed to find side effects (e.g., RECORD has a lot more LTFU than I saw in a drug study, when I did trials we studied how to improve adherence to improve the results—but a trial looking for side effects that cost the company money would do the reverse). We teach in pharmacy that conservative design is actually a problem in safety trials.

Even worse are trials which are aliased with a political agenda. It’s easy-peasy to design a trial to fail (the parachute trial was jumping from a height of 2 feet). That makes me a lot more critical when you see trials where the failure of the trial would be seen as a upside, because it is just so easy to botch a trial. Designing good trials is very hard (smarter people than I spend entire careers doing a handful of them). It’s a tough issue.

Lots to chew on here.

If school funding doesn’t really matter, why do people want their kid’s school to be well funded?

A question came up about the effects of school funding and student performance, and we were referred to this review article from a few years ago by Larry Hedges, Terri Pigott, Joshua Polanin, Ann Marie Ryan, Charles Tocci, and Ryan Williams:

One question posed continually over the past century of education research is to what extent school resources affect student outcomes. From the turn of the century to the present, a diverse set of actors, including politicians, physicians, and researchers from a number of disciplines, have studied whether and how money that is provided for schools translates into increased student achievement. The authors discuss the historical origins of the question of whether school resources relate to student achievement, and report the results of a meta-analysis of studies examining that relationship. They find that policymakers, researchers, and other stakeholders have addressed this question using diverse strategies. The way the question is asked, and the methods used to answer it, is shaped by history, as well by the scholarly, social, and political concerns of any given time. The diversity of methods has resulted in a body of literature too diverse and too inconsistent to yield reliable inferences through meta-analysis. The authors suggest that a collaborative approach addressing the question from a variety of disciplinary and practice perspectives may lead to more effective interventions to meet the needs of all students.

I haven’t followed this literature carefully. It was my vague impression that studies have found effects of schools on students’ test scores to be small. So, not clear that improving schools will do very much. On the other hand, everyone wants their kid to go to a good school. Just for example, all the people who go around saying that school funding doesn’t matter, they don’t ask to reduce the funding of their own kids’ schools. And I teach at an expensive school myself. So lots of pieces here, hard for me to put together.

I asked education statistics expert Beth Tipton what she thought, and she wrote:

I think the effect of money depends upon the educational context. For example, in higher education at selective universities, the selection process itself is what ensures success of students – the school matters far less. But in K-12, and particularly in under resourced areas, schools and finances can matter a lot – thus the focus on charter schools in urban locales.

I guess the problem here is that I’m acting like the typical uninformed consumer of research. The world is complicated, and any literature will be a mess, full of claims and counter-claims, but here I am expecting there to be a simple coherent story that I can summarize in a short sentence (“Schools matter” or “Schools don’t matter” or, maybe, “Schools matter but only a little”).

Given how frustrated I get when others come into a topic with this attitude, I guess it’s good for me to recognize when I do it.

“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates.

This one’s important.

Matt Lerner points us to this report by Rosie Bettle, Replicability & Generalisability: A Guide to CEA discounts.

“CEA” is cost-effectiveness analysis, and by “discounts” they mean what we’ve called the Edlin factor—“discount” is a better name than factor, because it’s a number that should be between 0 and 1, it’s what you should multiply a point estimate by to adjust for inevitable upward biases in reported effect-size estimates, issues discussed here and here, for example.

It’s pleasant to see some of my ideas being used for a practical purpose. I would just add that type M and type S errors should be lower for Bayesian inferences than for raw inferences that have not been partially pooled toward a reasonable prior model.

Also, regarding empirical estimation of adjustment factors, I recommend looking at the work of Erik van Zwet et al; here are some links:
What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?
How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”
The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments
Erik van Zwet explains the Shrinkage Trilogy
The significance filter, the winner’s curse and the need to shrink
Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”
Explaining that line, “Bayesians moving from defense to offense”

I’m excited about the application of these ideas to policy analysis.

Michael Wiebe has several new replications written up on his site.

Michael Wiebe writes:

I have several new replications written up on my site.

Moretti (2021) studies whether larger cities drive more innovation, but I find that the event study and instrumental variable results are due to coding errors. This means that the main OLS results should not be interpreted causally.

Atwood (2022) studies the long-term economic effects of the measles vaccine. I run an event study and find that the results are explained by trends, instead of a treatment effect of the vaccine.

I [Wiebe] am also launching a Patreon, so that I can work on replications full-time.

Interesting. We’ve discussed some of Wiebe’s investigations and questions in the past; see here, here, here, and here (on the topics of promotion in China, election forecasting, historical patents, and forking paths, respectively). So, good to hear that he’s still at it!

Progress in 2023, Jessica Edition

Since Aki and Andrew are doing it… 

Published:

Unpublished/Preprints:

Performed:

If I had to choose a favorite (beyond the play, of course) it would be the rational agent benchmark paper, discussed here. But I also really like the causal quartets paper. The first aims to increase what we learn from experiments in empirical visualization and HCI through comparison to decision-theoretic benchmarks. The second aims to get people to think twice about what they’ve learned from an average treatment effect. Both have influenced what I’ve worked on since.

A feedback loop can destroy correlation: This idea comes up in many places.

The people who go by “Slime Mold Time Mold” write:

Some people have noted that not only does correlation not imply causality, no correlation also doesn’t imply no causality. Two variables can be causally linked without having an observable correlation. Two examples of people noting this previously are Nick Rowe offering the example of Milton Friedman’s thermostat and Scott Cunningham’s Do Not Confuse Correlation with Causality chapter in Causal Inference: The Mixtape.

We realized that this should be true for any control system or negative feedback loop. As long as the control of a variable is sufficiently effective, that variable won’t be correlated with the variables causally prior to it. We wrote a short blog post exploring this idea if you want to take a closer look. It appears to us that in any sufficiently effective control system, causally linked variables won’t be correlated. This puts some limitations on using correlational techniques to study anything that involves control systems, like the economy, or the human body. The stronger version of this observation, that the only case where causally linked variables aren’t correlated is when they are linked together as part of a control system, may also be true.

Our question for you is, has anyone else made this observation? Is it recognized within statistics? (Maybe this is all implied by Peston’s 1972 “The Correlation between Targets and Instruments”? But that paper seems totally focused on economics and has only 14 citations. And the two examples we give above are both economists.) If not, is it worth trying to give this some kind of formal treatment or taking other steps to bring this to people’s attention, and if so, what would those steps look like?

My response: Yes, this has come up before. It’s a subtle point, as can be seen in some of the confused comments to this post. In that example, the person who brought up the feedback-destroys-correlation example was economist Rachael Meager, and it was a psychologist, a law professor, and some dude who describes himself as “a professor, writer and keynote speaker specializing in the quality of strategic thinking and the design of decision processes” who missed the point. So it’s interesting that you brought up an example of feedback from the economics literature.

Also, as I like to say, correlation does not even imply correlation.

The point you are making about feedback is related to the idea that, at equilibrium in an idealized setting, price elasticity of demand should be -1, because if it’s higher or lower than that, it would make sense to alter the price accordingly and slide up or down that curve to maximize total $.

I’m not up on all this literature; it’s the kind of thing that people were writing about a lot back in the 1950s related to cybernetics. It’s also related to the idea that clinical trials exist on a phase transition where the new treatment exists but has not yet been determined to be better or worse than the old. This is sometimes referred to as “equipoise,” which I consider to be a very sloppy concept.

The other thing is that everybody knows how correlations can be changed by selection (Simpson’s paradox, the example of high school grades and SAT scores among students who attend a moderately selective institution, those holes in the airplane wings, etc etc.). Knowing about one mechanism for correlations to be distorted can perhaps make people less attuned to other mechanisms such as the feedback thing.

So, yeah, a lot going on here.

Prediction isn’t everything, but everything is prediction

Image

This is Leo.

Explanation or explanatory modeling can be considered to be the use of statistical models for testing causal hypotheses or associations, e.g. between a set of covariates and a response variable. Prediction or predictive modeling, (supposedly) on the other hand, is the act of using a model—or device, algorithm—to produce values of new, existing, or future observations. A lot has been written about the similarities and differences between explanation and prediction, for example Breiman (2001), Shmueli (2010), Billheimer (2019), and many more.

These are often thought to be separate dimensions of statistics, but Jonah and I have been discussing for a long time that in some sense there may actually be no such thing as explanation without prediction. Basically, although prediction itself is not the only goal of inferential statistics, everything—every objective—in inferential statistics can be reframed through the lense of prediction.

Hypothesis testing, ability estimation, hierarchical modeling, treatment effect estimation, causal inference problems, etc., can all be described in our opinion from a (inferential) predictive perspective. So far we have not found an example for which there is no way to reframe it as prediction problem. So I ask you: is there any inferential statistical ambition that cannot be described in predictive terms?

P.S. Like Billheimer (2019) and others, we think that inferential statistics should be considered as inherently predictive and be focused primarily on probabilistic predictions of observable events and quantities, rather than focusing on statistical estimates of unobersvable parameters that do not exist outside of our highly contrived models. Similarly, we also feel that the goal of Bayesian modeling should not be taught to students as finding the posterior distribution of unobservables, but rather as finding the posterior predictive distribution of the observables (with finding the posterior as an intermediate step); even when we don’t only care about predictive accuracy and we still care about understanding how a model works (model checking, GoF measures), we think the predictive modeling interpretation is generally more intuitive and effective.

What to trust in the newspaper? Example of “The Simple Nudge That Raised Median Donations by 80%”

Greg Mayer points to this news article, “The Simple Nudge That Raised Median Donations by 80%,” which states:

A start-up used the Hebrew word “chai” and its numerical match, 18, to bump up giving amounts. . . . It’s a common donation amount among Jews — $18, $180, $1,800 or even $36 and other multiples.

So Daffy lowered its minimum gift to $18 and then went further, prompting any donor giving to any Jewish charity to bump gifts up by some related amount. Within a year, median gifts had risen to $180 from $100. . . .

I see several warning signs here:

1. “Within a year, median gifts had risen to $180 from $100.” This is a before/after change, not a direct comparison of outcomes.

2. No report, just a quoted number which could easily have been made up. Yes, the numbers in a report can be fabricated too, but that takes more work and is more risk. Making up numbers when talking with a reporter, that’s easy.

3. The people who report the number are motivated to claim success; the reporter is motivated to report a success. The article is filled with promotion for this company. It’s a short article that mentions “Daffy” 6 times in the short article, for example this bit which reads like a straight-up ad:

If you have children, grandchildren, nieces or nephews, there’s another possibility. Daffy has a family plan that allows children to prompt their adult relatives to support a cause the children choose. Why not put the app on their iPhones or iPads so they can make suggestions and let, for example, a 12-year-old make $12 donations to 12 nonprofits each year?

Why not, indeed? Even better, why not have them make their donations directly to Daffy and cut out the middleman?? Look, I’m not saying that the people behind Daffy are doing anything wrong; it’s just that this is public relations, not journalism.

4. Use of the word “nudge” in the headline is consistent with business-press hype. Recall that “nudge” is a subfield whose proponents are well connected in the media and routinely make exaggerated claims.

So, yeah, an observational comparison with no documentation, in an article that’s more like an advertisement, that’s kinda sus. Not that the claim is definitely wrong, there’s just no good reason for us to take it seriously.

“Integrated Inferences: Causal Models for Qualitative and Mixed-Method Research”

Macartan Humphreys and Alan Jacobs write:

We are delighted to announce the publication of our new book, Integrated Inferences: Causal Models for Qualitative and Mixed-Method Research.

This book has been quite a few years in the making, but we are really happy with how it has turned out and hope you will find it useful for your research and your teaching.

Integrated Inferences provides an introduction to fundamental principles of causal inference and Bayesian updating and shows how these tools can be used to implement and justify inferences using within-case (process tracing) evidence, correlational patterns across many cases, or a mix of the two. The basic idea builds on work by Pearl, Gelman, and others. If we can represent theories graphically – as causal models – we can then update our beliefs about these models using Bayesian methods, and then draw inferences about populations or cases from different types of data. We also demonstrate how causal models can guide research design, informing choices about which cases, observations, and mixes of methods will be most useful for addressing any given question.

We’ve attached Chapter 1 in case you want to have a look. See also https://integrated-inferences.github.io/ for resources including a link to a full open access version of the book.

The book has an accompanying R package on CRAN: CausalQueries. The package can be used to make, update, and query causal models using quite simple syntax. We’d be delighted if you took it for a spin. There’s a draft guide to the package here (Tietz et al).

The only thing about the above note that confuses me is the “but” in “This book has been quite a few years in the making, but we are really happy with how it has turned out…” I assume that it took years because they were trying to make it just right, so it’s no surprise that that they’re happy with it, conditional on them finally being ready to release it. The “but” seems to imply some sort of discordance which I don’t see here.

This empirical paper has been cited 1616 times but I don’t find it convincing. There’s no single fatal flaw, but the evidence does not seem so clear. How to think about this sort of thing? What to do? First, accept that evidence might not all go in one direction. Second, make lots of graphs. Also, an amusing story about how this paper is getting cited nowadays.

1. When can we trust? How can we navigate social science with skepticism?

2. Why I’m not convinced by that Quebec child-care study

3. 20 years on

1. When can we trust? How can we navigate social science with skepticism?

The other day I happened to run across a post from 2016 that I think is still worth sharing.

Here’s the background. Someone pointed me to a paper making the claim that “Canada’s universal childcare hurt children and families. . . . the evidence suggests that children are worse off by measures ranging from aggression to motor and social skills to illness. We also uncover evidence that the new child care program led to more hostile, less consistent parenting, worse parental health, and lower‐quality parental relationships.”

I looked at the paper carefully and wasn’t convinced. In short, the evidence went in all sorts of different directions, and I felt that the authors had been trying too hard to fit it all into a consistent story. It’s not that the paper had fatal flaws—it was not at all in the category of horror classics such as the beauty-and-sex-ratio paper, the ESP paper, the himmicanes paper, the air-rage paper, the pizzagate papers, the ovulation-and-voting paper, the air-pollution-in-China paper, etc etc etc.—it just didn’t really add up to me.

The question then is, if a paper can appear in a top journal, have no single killer flaw but still not be convincing, can we trust anything at all in the social sciences? At what point does skepticism become nihilism? Must I invoke the Chestertonian principle on myself?

I don’t know.

What I do think is that the first step is to carefully assess the connection between published claims, the analysis that led to these claims, and the data used in the analysis. The above-discussed paper has a problem that I’ve seen a lot, which is an implicit assumption that all the evidence should go in the same direction, a compression of complexity which I think is related to the cognitive illusion that Tversky and Kahneman called “the law of small numbers.” The first step in climbing out of this sort of hole is to look at lots of things at once, rather than treating empirical results as a sort of big bowl of fruit where the researcher can just pick out the the juiciest items and leave the rest behind.

2. Why I’m not convinced by that Quebec child-care study

Here’s what I wrote on that paper back in 2006:

Yesterday we discussed the difficulties of learning from a small, noisy experiment, in the context of a longitudinal study conducted in Jamaica where researchers reported that an early-childhood intervention program caused a 42%, or 25%, gain in later earnings. I expressed skepticism.

Today I want to talk about a paper making an opposite claim: “Canada’s universal childcare hurt children and families.”

I’m skeptical of this one too.

Here’s the background. I happened to mention the problems with the Jamaica study in a talk I gave recently at Google, and afterward Hal Varian pointed me to this summary by Les Picker of a recent research article:

In Universal Childcare, Maternal Labor Supply, and Family Well-Being (NBER Working Paper No. 11832), authors Michael Baker, Jonathan Gruber, and Kevin Milligan measure the implications of universal childcare by studying the effects of the Quebec Family Policy. Beginning in 1997, the Canadian province of Quebec extended full-time kindergarten to all 5-year olds and included the provision of childcare at an out-of-pocket price of $5 per day to all 4-year olds. This $5 per day policy was extended to all 3-year olds in 1998, all 2-year olds in 1999, and finally to all children younger than 2 years old in 2000.

(Nearly) free child care: that’s a big deal. And the gradual rollout gives researchers a chance to estimate the effects of the program by comparing children at each age, those who were and were not eligible for this program.

The summary continues:

The authors first find that there was an enormous rise in childcare use in response to these subsidies: childcare use rose by one-third over just a few years. About a third of this shift appears to arise from women who previously had informal arrangements moving into the formal (subsidized) sector, and there were also equally large shifts from family and friend-based child care to paid care. Correspondingly, there was a large rise in the labor supply of married women when this program was introduced.

That makes sense. As usual, we expect elasticities to be between 0 and 1.

But what about the kids?

Disturbingly, the authors report that children’s outcomes have worsened since the program was introduced along a variety of behavioral and health dimensions. The NLSCY contains a host of measures of child well being developed by social scientists, ranging from aggression and hyperactivity, to motor-social skills, to illness. Along virtually every one of these dimensions, children in Quebec see their outcomes deteriorate relative to children in the rest of the nation over this time period.

More specifically:

Their results imply that this policy resulted in a rise of anxiety of children exposed to this new program of between 60 percent and 150 percent, and a decline in motor/social skills of between 8 percent and 20 percent. These findings represent a sharp break from previous trends in Quebec and the rest of the nation, and there are no such effects found for older children who were not subject to this policy change.

Also:

The authors also find that families became more strained with the introduction of the program, as manifested in more hostile, less consistent parenting, worse adult mental health, and lower relationship satisfaction for mothers.

I just find all this hard to believe. A doubling of anxiety? A decline in motor/social skills? Are these day care centers really that horrible? I guess it’s possible that the kids are ruining their health by giving each other colds (“There is a significant negative effect on the odds of being in excellent health of 5.3 percentage points.”)—but of course I’ve also heard the opposite, that it’s better to give your immune system a workout than to be preserved in a bubble. They also report “a policy effect on the treated of 155.8% to 394.6%” in the rate of nose/throat infection.

OK, hre’s the research article.

The authors seem to be considering three situations: “childcare,” “informal childcare,” and “no childcare.” But I don’t understand how these are defined. Every child is cared for in some way, right? It’s not like the kid’s just sitting out on the street. So I’d assume that “no childcare” is actually informal childcare: mostly care by mom, dad, sibs, grandparents, etc. But then what do they mean by the category “informal childcare”? If parents are trading off taking care of the kid, does this count as informal childcare or no childcare? I find it hard to follow exactly what is going on in the paper, starting with the descriptive statistics, because I’m not quite sure what they’re talking about.

I think what’s needed here is some more comprehensive organization of the results. For example, consider this paragraph:

The results for 6-11 year olds, who were less affected by this policy change (but not unaffected due to the subsidization of after-school care) are in the third column of Table 4. They are largely consistent with a causal interpretation of the estimates. For three of the six measures for which data on 6-11 year olds is available (hyperactivity, aggressiveness and injury) the estimates are wrong-signed, and the estimate for injuries is statistically significant. For excellent health, there is also a negative effect on 6-11 year olds, but it is much smaller than the effect on 0-4 year olds. For anxiety, however, there is a significant and large effect on 6-11 year olds which is of similar magnitude as the result for 0-4 year olds.

The first sentence of the above excerpt has a cover-all-bases kind of feeling: if results are similar for 6-11 year olds as for 2-4 year olds, you can go with “but not unaffected”; if they differ, you can go with “less effective.” Various things are pulled out based on whether they are statistically significant, and they never return to the result for anxiety, which would seem to contradict their story. Instead they write, “the lack of consistent findings for 6-11 year olds confirm that this is a causal impact of the policy change.” “Confirm” seems a bit strong to me.

The authors also suggest:

For example, higher exposure to childcare could lead to increased reports of bad outcomes with no real underlying deterioration in child behaviour, if childcare providers identify negative behaviours not noticed (or previously acknowledged) by parents.

This seems like a reasonable guess to me! But the authors immediately dismiss this idea:

While we can’t rule out these alternatives, they seem unlikely given the consistency of our findings both across a broad spectrum of indices, and across the categories that make up each index (as shown in Appendix C). In particular, these alternatives would not suggest such strong findings for health-based measures, or for the more objective evaluations that underlie the motor-social skills index (such as counting to ten, or speaking a sentence of three words or more).

Health, sure: as noted above, I can well believe that these kids are catching colds from each other.

But what about that motor-skills index? Here are their results from the appendix:

Screen Shot 2016-06-22 at 1.56.04 PM

I’m not quite sure whether + or – is desirable here, but I do notice that the coefficients for “can count out loud to 10” and “spoken a sentence of 3 words or more” (the two examples cited in the paragraph above) go in opposite directions. That’s fine—the data are the data—but it doesn’t quite fit their story of consistency.

More generally, the data are addressed in an scattershot manner. For example:

We have estimated our models separately for those with and without siblings, finding no consistent evidence of a stronger effect on one group or another. While not ruling out the socialization story, this finding is not consistent with it.

This appears to be the classic error of interpretation of a non-rejection of a null hypothesis.

And here’s their table of key results:

Screen Shot 2016-06-22 at 1.59.53 PM

As quantitative social scientists we need to think harder about how to summarize complicated data with multiple outcomes and many different comparisons.

I see the current standard ways to summarize this sort of data are:

(a) Focus on a particular outcome and a particular comparison (choosing these ideally, though not usually, using preregistration), present that as the main finding and then tag all else as speculation.

Or, (b) Construct a story that seems consistent with the general pattern in the data, and then extract statistically significant or nonsignificant comparisons to support your case.

Plan (b) was what was done again, and I think it has problems: lots of stories can fit the data, and there’s a real push toward sweeping any anomalies aside.

For example, how do you think about that coefficient of 0.308 with standard error 0.080 for anxiety among the 6-11-year-olds? You can say it’s just bad luck with the data, or that the standard error calculation is only approximate and the real standard error should be higher, or that it’s some real effect caused by what was happening in Quebec in these years—but the trouble is that any of these explanations could be used just as well to explain the 0.234 with standard error 0.068 for 2-4-year-olds, which directly maps to one of their main findings.

Once you start explaining away anomalies, there’s just a huge selection effect in which data patterns you choose to take at face value and which you try to dismiss.

So maybe approach (a) is better—just pick one major outcome and go with it? But then you’re throwing away lots of data, that can’t be right.

I am unconvinced by the claims of Baker et al., but it’s not like I’m saying their paper is terrible. They have an identification strategy, and clean data, and some reasonable hypotheses. I just think their statistical analysis approach is not working. One trouble is that statistics textbooks tend to focus on stand-alone analyses—getting the p-value right, or getting the posterior distribution, or whatever, and not on how these conclusions fit into the big picture. And of course there’s lots of talk about exploratory data analysis, and that’s great, but EDA is typically not plugged into issues of modeling, data collection, and inference.

What to do?

OK, then. Let’s forget about the strengths and the weaknesses of the Baker et al. paper and instead ask, how should one evaluate a program like Quebec’s nearly-free preschool? I’m not sure. I’d start from the perspective of trying to learn what we can from what might well be ambiguous evidence, rather than trying to make a case in one direction or another. And lots of graphs, which would allow us to see more in one place, that’s much better than tables and asterisks. But, exactly what to do, I’m not sure. I don’t know whether the policy analysis literature features any good examples of this sort of exploration. I’d like to see something, for this particular example and more generally as a template for program evaluation.

3. Nearly 20 years on

So here’s the story. I heard about this work in 2016, from a press release issued in 2006, the article was published in a top economics journal in 2008, it appeared in preprint form in 2005, and it was based on data collected in the late 1990s. And here we are discussing it again in 2023.

It’s kind of beating a dead horse to discuss a 20-year-old piece of research, but you know what they say about dead horses. Also, according to Google Scholar, the article has 1616 citations, including 120 citations in 2023 alone, so, yeah, still worth discussing.

That said, not all the references refer to the substance of the paper. For example, the very first paper on Google Scholar’s list of citers is a review article, Explaining the Decline in the US Employment-to-Population Ratio, and when I searched to see what they said about this Canada paper (Baker, Gruber, and Milligan 2008), here’s what was there:

Additional evidence on the effects of publicly provided childcare comes from the province of Quebec in Canada, where a comprehensive reform adopted in 1997 called for regulated childcare spaces to be provided to all children from birth to age five at a price of $5 per day. Studies of that reform conclude that it had significant and long-lasting effects on mothers’ labor force participation (Baker, Gruber, and Milligan 2008; Lefebvre and Merrigan 2008; Haeck, Lefebvre, and Merrigan 2015). An important feature of the Quebec reform was its universal nature; once fully implemented, it made very low-cost childcare available for all children in the province. Nollenberger and Rodriguez-Planas (2015) find similarly positive effects on mothers’ employment associated with the introduction of universal preschool for three-year-olds in Spain.

They didn’t mention the bit about “the evidence suggests that children are worse off” at all! Indeed, they’re just kinda lumping this in with positive studies on “the effects of publicly provided childcare.” Yes, it’s true that this new article specifically refers to “similarly positive effects on mothers’ employment,” and that earlier paper, while negative about the effect of universal child care on kids, did say, “Maternal labor supply increases significantly.” Still, when it comes to sentiment analysis, that 2008 paper just got thrown into the positivity blender.

I don’t know how to think about this.

On one hand, I feel bad for Baker et al.: they did this big research project, they achieved the academic dream of publishing it in a top journal, it’s received over 1616 citations and remains relevant today—but, when it got cited, its negative message was completely lost! I guess they should’ve given their paper a more direct title. Instead of “Universal Child Care, Maternal Labor Supply, and Family Well‐Being,” they should’ve called it something like: “Universal Child Care: Good for Mothers’ Employment, Bad for Kids.”

On the other hand, for the reasons discussed above, I don’t actually believe their strong claims about the child care being bad for kids, so I’m kinda relieved that, even though the paper is being cited, some of its message has been lost. You win some, you lose some.

Of course its preregistered. Just give me a sec

This is Jessica. I was going to post something on Bak-Coleman and Devezer’s response to the Protzko et al. paper on the replicability of research that uses rigor-enhancing practices like large samples, preregistration, confirmatory tests, and methodological transparency, but Andrew beat me to it. But since his post didn’t get into one of the surprising aspects of their analysis (beyond the paper making causal claim without a study design capable of assessing causality), I’ll blog on it anyway.

Bak-Coleman and Devezer describe three ways in which the measure of replicability that Protzko et al. use to argue that the 16 effects they study are more replicable than effects in prior studies deviates from prior definitions of replicability:

  1. Protzko et al. define replicability as the chance that any replication achieves significance in the hypothesized direction as opposed to whether the results of the confirmation study and the replication were consistent 
  2. They include self-replications in calculating the rate
  3. They include repeated replications of the same effect and replications across different effects in calculating the rate

Could these deviations in how replicability is defined have been decided post-hoc, so that the authors could present positive evidence for their hypothesis that rigor-enhancing practices work? If they preregistered their definition of replicability, we would not be so concerned about this possibility.  Luckily, the authors report that “All confirmatory tests, replications and analyses were preregistered both in the individual studies (Supplementary Information section 3 and Supplementary Table 2) and for this meta-project (https://osf.io/6t9vm).”

But wait – according to Bak-Coleman and Devezer:

the analysis on which the titular claim depends was not preregistered. There is no mention of examining the relationship between replicability and rigor-improving methods, nor even how replicability would be operationalized despite extensive descriptions of the calculations of other quantities. With nothing indicating this comparison or metric it rests on were planned a priori, it is hard to distinguish the core claim in this paper from selective reporting and hypothesizing after the results are known. 

Uh-oh, that’s not good. At this point, some OSF sleuthing was needed. I poked around the link above, and the associated project containing analysis code. There are a couple analysis plans: Proposed Overarching Analyses for Decline Effect final.docx, from 2018, and Decline Effect Exploratory analyses and secondary data projects P4.docx, from 2019. However, these do not appear to describe the primary analysis of replicability in the paper (the first describes an analysis that ends up in the Appendix, and the second a bunch of exploratory analyses that don’t appear in the paper). About a year later, the analysis notebooks with the results they present in the main body of the paper were added. 

According to Bak-Coleman on X/Twitter: 

We emailed the authors a week ago. They’ve been responsive but as of now, they can’t say one way or another if the analyses correspond to a preregistration. They think they may be in some documentation.

In the best case scenario where the missing preregistration is soon found, this example suggests that there are still many readers and reviewers for whom some signal of rigor suffices even when the evidence of it is lacking. In this case, maybe the reputation of authors like Nosek reduced the perceived need on the part of the reviewers to track down the actual preregistration. But of course, even those who invented rigor-enhancing practices can still make mistakes!

In the alternative scenario where the preregistration is not found soon, what is the correct course of action? Surely at least a correction is in order? Otherwise we might all feel compelled to try our luck at signaling preregistration without having to inconvenience ourselves by actually doing.

More optimistically, perhaps there are exciting new research directions that could come out of this. Like, wearable preregistration, since we know from centuries of research and practice that it’s harder to lose something when it’s sewn to your person. Or, we could submit our preregistrations to OpenAI, I mean Microsoft, who could make a ChatGPT-enabled Preregistration Buddy who not only trained on your preregistration, but also knows how to please a human judge who wants to ask questions about what it said.

Hey! Here are some amazing articles by George Box from around 1990. Also there’s some mysterious controversy regarding his center at the University of Wisconsin.

The webpage is maintained by John Hunter, son of Box’s collaborator William Hunter, and I came across it because I was searching for background on the paper-helicopter example that we use in our classes to teach principles of experimental design and data analysis.

There’s a lot to say about the helicopter example and I’ll save that for another post.

Here I just want to talk about how much I enjoyed reading these thirty-year-old Box articles.

A Box Set from 1990

Many of the themes in those articles continue to resonate today. For example:

• The process of learning. Here’s Box from his 1995 article, “Total Quality: Its Origins and its Future”:

Scientific method accelerated that process in at least three ways:

1. By experience in the deduction of the logical consequences of the group of facts each of which was individually known but had not previously been brought together.

2. By the passive observation of systems already in operation and the analysis of data coming from such systems.

3. By experimentation – the deliberate staging of artificial experiences which often might ordinarily never occur.

A misconception is that discovery is a “one shot” affair. This idea dies hard. . . .

• Variation over time. Here’s Box from his 1989 article, “Must We Randomize Our Experiment?”:

We all live in a non-stationary world; a world in which external factors never stay still. Indeed the idea of stationarity – of a stable world in which, without our intervention, things stay put over time – is a purely conceptual one. The concept of stationarity is useful only as a background against which the real non-stationary world can be judge. For example, the manufacture of parts is an operation involving machines and people. But the parts of a machine are not fixed entities. They are wearing out, changing their dimensions, and losing their adjustment. The behavior of the people who run the machines is not fixed either. A single operator forgets things over time and alters what he does. When a number of operators are involved, the opportunities for change because of failures to communicate are further multiplied. Thus, if left to itself any process will drift away from its initial state. . . .

Stationarity, and hence the uniformity of everything depending on it, is an unnatural state that requires a great deal of effort to achieve. That is why good quality control takes so much effort and is so important. All of this is true, not only for manufacturing processes, but for any operation that we would like to be done consistently, such as the taking of blood pressures in a hospital or the performing of chemical analyses in a laboratory. Having found the best way to do it, we would like it to be done that way consistently, but experience shows that very careful planning, checking, recalibration and sometimes appropriate intervention, is needed to ensure that this happens.

Here an example, from Box’s 1992 article, “How to Get Lucky”:

For illustration Figure 1(a) shows a set of data designed so seek out the source of unacceptably large variability which, it was suspected, might be due to small differences in five, supposedly identical, heads on a machine. To test this idea, the engineer arranged that material from each of the five heads was sampled at roughly equal intervals of time in each of six successive eight-hour periods. . . . the same analysis strongly suggested that real differences in means occurred between the six eight-hour periods of time during which the experiment was conducted. . . .

• Workflow. Here’s Box from his 1999 article, “Statistics as a Catalyst to Learning by Scientific Method Part II-Discussion”:

Most of the principles of design originally developed for agricultural experimentation would be of great value in industry, but the most industry experimentation differed from agricultural experimentation in two major respects. These I will call immediacy and sequentially.

What I mean by immediacy is that for most of our investigations the results were available, if not within hours, then certainly within days and in rare cases, even within minutes. This was true whether the investigation was conducted in a laboratory, a pilot plant or on the full scale. Furthermore, because the experimental runs were usually made in sequence, the information obtained from each run, or small group of runs, was known and could be acted upon quickly and used to plan the next set of runs. I concluded that the chief quarrel that our experimenters had with using “statistics” was that they thought it would mean giving up the enormous advantages offered by immediacy and sequentially. Quite rightly, they were not prepared to make these sacrifices. The need was to find ways of using statistics to catalyze a process of investigation that was not static, but dynamic.

There’s lots more. It’s funny to read these things that Box wrote back then, that I and others have been saying over and over again in various informal contexts, decades later. It’s a problem with our statistical education (including my own textbooks) that these important ideas are buried.

More Box

A bunch of articles by Box, with some overlap but not complete overlap with the above collection, is at the site of the University of Wisconsin, where he worked for many years. Enjoy.

Some kinda feud is going on

John Hunter’s page also has this:

The Center for Quality and Productivity Improvement was created by George Box and Bill Hunter at the University of Wisconsin-Madison in 1985.

In the first few years reports were published by leading international experts including: W. Edwards Deming, Kaoru Ishikawa, Peter Scholtes, Brian Joiner, William Hunter and George Box. William Hunter died in 1986. Subsequently excellent reports continued to be published by George Box and others including: Gipsie Ranney, Soren Bisgaard, Ron Snee and Bill Hill.

These reports were all available on the Center’s web site. After George Box’s death the reports were removed. . . .

It is a sad situation that the Center abandonded the ideas of George Box and Bill Hunter. I take what has been done to the Center as a personal insult to their memory. . . .

When diagonoised with cancer my father dedicated his remaining time to creating this center with George to promote the ideas George and he had worked on throughout their lives: because it was that important to him to do what he could. They did great work and their work provided great benefits for long after Dad’s death with the leadership of Bill Hill and Soren Bisgaard but then it deteriorated. And when George died the last restraint was eliminated and the deterioration was complete.

Wow. I wonder what the story was. I asked someone I know who works at the University of Wisconsin and he had no idea. Box died in 2013 so it’s not so long ago; there must be some people who know what happened here.

Experimental reasoning in social science

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

In the present article, I’ll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

. . .

For the rest, you can read the article, written in question is from 2010, ultimately published in a book in 2014, and still relevant today, I think! For more on this perspective you can see chapters 18-21, the causal inference chapters of Regression and Other Stories.

Also relevant is the idea that mathematical and statistical reasoning are themselves experimental, as discussed in this article on Lakatos and in this talk, “When You do Applied Statistics, You’re Acting Like a Scientist. Why Does this matter?”

Discovering what mattered: Answering reverse causal questions using change-point models

Felix Pretis writes:

I came across your 2013 paper with Guido Imbens, “Why ask why? Forward causal inference and reverse causal questionson reverse causal questions,” and found it to be extremely useful for a closely-related project my co-author Moritz Schwarz and I have been working on.

We introduce a formal approach to answer “reverse causal questions” by expanding on the idea mentioned in your paper that reverse causal questions involve “searching for new variables.” We place the concept of reverse causal questions into the domain of variable and model selection. Specifically, we focus on detecting and estimating treatment effects when both treatment assignment and timing is unknown. The setting of unknown treatment reflects the problem often faced by policy makers: rather than trying to understand whether a particular intervention caused an outcome to change, they might be concerned with the broader question of what affected the outcome in general, but they might be unsure what treatment interventions took place. For example, rather than asking whether carbon pricing reduced CO2 emissions, a policy maker might be interested in what reduces CO2 emissions in general?

We show that such unknown treatment can be detected as structural breaks in panels by using machine learning methods to remove all but relevant treatment interaction terms that capture heterogeneous treatment effects. We demonstrate the feasibility of this approach by detecting the impact of ETA terrorism on Spanish regional GDP per capita without prior knowledge of its occurrence.

Predis and Schwartz describe their general idea in a paper, “Discovering what mattered: Answering reverse causal questions by detecting unknown treatment assignment and timing as breaks in panel models,” and they published an application of the approach in a paper, “Attributing agnostically detected large reductions in road CO2 emissions to policy mixes.”

It’s so cool to see this sort of work being done, transferring general concepts about causal inference to methods that can be used in real applications.

Springboards to overconfidence: How can we avoid . . .? (following up on our discussion of synthetic controls analysis)

Following up on our recent discussion of synthetic control analysis for causal inference, Alberto Abadie points to this article from 2021, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects.

Abadie’s paper is very helpful in that it lays out the key assumptions and decision points, which can help us have a better understanding of what went so wrong in the paper on Philadelphia crime rates that we discussed in my earlier post.

I think it’s a general concern in methods papers–mine included!—that we tend to focus more on examples where the method works well, than on examples where it doesn’t. Abadie’s paper has an advantage over mine in that he gives conditions under which a method will work, and it’s not his fault that researchers then use the methods and get bad answers.

Regarding the specific methods issue, of course there are limits to what can be learned from N=1 treated units, whether analyzed using synthetic control or any other approach. It seems that researchers sometimes lose track of that point in their desire to make strong statements. On a very technical level, I suspect that, if researchers are using a weighted average as a comparison, that they’d do better using some regularization rather than just averaging over a very small number of other cases. But I don’t think that would help much in that particular application that we were discussing on the blog.

The deeper problem

The question is, when scholars such as Abadie write such clear descriptions of a method, including all its assumptions, how is it that applied researchers such as the authors of that Philadelphia article make such a mess of things? The problem is not unique to synthetic control analysis; it also arises with other “identification strategies” such as regression discontinuity, instrumental variables, linear regression, and plain old randomized experimentation. In all these cases, researchers often seem to end up using the identification strategy not as a tool for learning from data but rather as a sort of springboard to overconfidence. Beyond causal inference, there are all the well-known misapplications of Bayesian inference and classical p-values. No method is safe.

So, again, nothing special about synthetic control analysis. But what did happen in the example that got this discussion started? To quote from the original article:

The research question here is whether the application of a de-prosecution policy has an effect on the number of homicides for large cities in the United States. Philadelphia presents a natural experiment to examine this question. During 2010–2014, the Philadelphia District Attorney’s Office maintained a consistent and robust number of prosecutions and sentencings. During 2015–2019, the office engaged in a systematic policy of de-prosecution for both felony and misdemeanor cases. . . . Philadelphia experienced a concurrent and historically large increase in homicides.

After looking at the time series, here’s my quick summary: Philadelphia’s homicide rate went up since 2014 during the same period that it decreased prosecutions, and this was part of a national trend of increased homicides—but there’s no easy way given the directly available information to compare to other cities with and without that policy.

I’ll refer you to my earlier post and its comment thread for more on the details.

At this point, the authors of the original article used a synthetic controls analysis, following the general approach described in the Abadie paper. the comparisons they make are to that weighted average of Detroit, New Orleans, and New York. The trouble is . . . that’s just 3 cities, and homicide rates can vary a lot from city to city. There’s no good reason to think that an average of three cities that give you numbers comparable to Philadelphia’s for the homicide rates or counts in the five previous years will give you a reasonable counterfactual for trends in the next five years. Beyond this, some outside researchers pointed out many forking paths in the published analysis. Forking paths are not in themselves a problem—my open applied work is full of un-preregistered data coding and analysis decisions—; the relevance here is that they help explain how it’s possible for researchers to get apparently “statistically significant” results from noisy data.

So what went wrong? Abadie’s paper discusses a mathematical problem: if you want to compare Philadelphia to some weighted average of the other 96 cities, and if you want these weights to be positive and sum to 1 and be estimated using an otherwise unregularized procedure, then there are certain statistical properties associated with using a procedure which, in this case, if various decisions are made, will lead to choosing a particular average of Detroit, New Orleans, and New York. There’s nothing wrong with doing this, but, ultimately, all you have is a comparison of 1 city to 3 cities, and it’s completely legit from an applied perspective to look at these cities and recognize how different they all are.

It’s not the fault of the synthetic control analysis if you have N=1 in the treatment group. It’s just the way things go. The error is to use that analysis to make strong claims, and the further error is to think that the use of this particular method—or any particular method—should insulate the analysis from concerns about reasonableness. If you want to compare one city to 96 others, then your analysis will rely on assumptions about comparability of the different cities, and not just on one particular summary such as the homicide counts during a five-year period.

You can say that this general concern arises with linear regression as well—you’re only adjusting for whatever pre-treatment variables that are included in the model. For example, when we estimated the incumbency advantage in congressional elections by comparing elections with incumbents running for reelection to elections in open seats, adjusting for previous vote share and party control, it would be a fair criticism to say that maybe the treatment and control cases differed in other important ways not included in the analysis. And we looked at that! I’m not saying our analysis was perfect; indeed, a decade and a half later we reanalyzed the data with a measurement-error model and got what we thing were improved results. It was a big help that we had replication: many years, and many open-seat and incumbent elections in each year. This Philadelphia analysis is different because it’s N=1. If we tried to do linear regression with N=1, we’d have all sorts of problems. Unfortunately, the synthetic control analysis did not resolve the N=1 problem—it’s not supposed to!—but it did seem to lead the authors into a some strong claims that did not make a lot of sense.

P.S. I sent the above to Abadie, who added:

I would like to share a couple of thoughts about N=1 and whether it is good or bad to have a small number of units in the comparison group.

Synthetic controls were originally proposed to address the N=1 (or low N) setting in cases with aggregate and relatively noiseless data and strong co-movement across units. I agree with you that they do not mechanically solve the N=1 problem in general (and that nothing does!). They have to be applied with care and there will be settings where they do not produce credible estimates (e.g., noisy series, short pre-intervention windows, poor pre-intervention fit, poor prediction in hold-out pre-intervention windows, etc). There are checks (e.g., predictive power in hold-out pre-intervention windows) that help assess the credibility of synthetic control estimates in applied settings.

Whether a few controls or many controls are better depends on the context of the investigation and on what one is trying to attain. Precision may call for using many comparisons. But there is a trade-off. The more units we use as comparisons, the less similar those may be relative to the treated unit. And the use of a small number of units allows us to evaluate / correct for potential biases created by idiosyncratic shocks and / or interference effects on the comparison units. If the aggregate series are “noiseless enough” like in the synthetic control setting, one might care more about reducing bias than about attaining additional precision.