Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

How to code and impute income in studies of opinion polls?

Nate Cohn asks:

What’s your preferred way to handle income in a regression when income categories are inconsistent across several combined survey datasets? Am I best off just handling this with multiple categorical variables? Can I safely create a continuous variable?

My reply:

I thought a lot about this issue when writing Red Sate Blue State. My preferred strategy is to use a variable that we could treat as continuous. For example when working with ANES data I was using income categories 1,2,3,4,5 which corresponded to income categories 1-16th percentile, 16-33rd, 34-66th, 67-95th, and 96-100th. If you have different surveys with different categories, you could use some somewhat consistent scaling, for example one survey you might code as 1,3,5,7 and another might be coded as 2,4,6,8. I expect that other people would disagree with this advice but this the sort of thing that I was doing. I’m not so much worried about the scale being imperfect or nonlinear. But if you have a non-monotonic relation, you’ll have to be more careful.

Cohn responds:

Two other thoughts for consideration:

— I am concerned about non-monotonicity. At least in this compilation of 2020 data, the Democrats do best among rich and poor, and sag in the middle. It seems even more extreme when we get into the highest/lowest income strata, ala ANES. I’m not sure this survives controls—it seems like there’s basically no income effect after controls—but I’m hesitant to squelch a possible non-monotonic effect that I haven’t ruled out.

—I’m also curious for your thoughts on a related case. Suppose that (a) dataset includes surveys that sometimes asked about income and sometimes did not ask about income, (b) we’re interested in many demographic covariates, besides income, and; (c) we’d otherwise clearly specify the interaction between income and the other variables. The missing income data creates several challenges. What should we do?

I can imagine some hacky solutions to the NA data problem outright removing observations (say, set all NA income to 1 and interact our continuous income variable with whether we have actual income data), but if we interact other variables with the NA income data there are lots of cases (say, MRP where the population strata specifies income for full pop, not in proportion to survey coverage) where we’d risk losing much of the power gleaned from other surveys about the other demographic covariates. What should we do here?

My quick recommendation is to fit a model with two stages, first predicting income given your other covariates, then predicting your outcome of interest (issue attitude, vote preference, whatever) given income and the other covariates. You can fit the two models simultaneously in one Stan program. I guess then you will want some continuous coding for income (could be something like sqrt(income) with income topcoded at $300K) along with a possibly non-monotonic model at the second level.

Hey! A new (to me) text message scam! Involving a barfing dog!

Last year Columbia changed our phone system so now we can accept text messages. This can be convenient, and sometimes people reach me that way.

But then the other day this text came in:

And, the next day:

Someone’s dog has been vomiting, and this person is calling from two different numbers—home and work, perhaps? That’s too bad! I hope they reach the real Dr. Ella before the dog gets too sick.

Then this:

And now I started getting suspicious. How exactly does someone get my phone as a wrong number for a veterinarian? I’ve had this work number for over 25 years! It could be that someone typed in a phone number wrong. But . . . how likely is it that two unrelated people (the owner of a sick dog and the seller of veterinary products) would mistype someone’s number in the exact same way on the exact same day?

Also, “Dr. Ella”? I get that people give their doctors nicknames like that, but in a message to the office they would use the doctor’s last name, no?

Meanwhile, these came in:

Lisa, Ella, whatever. Still it seemed like some kinda mixup, and I had no thought that it might be a scam until I came across this post from Max Read, “What’s the deal with all those weird wrong-number texts?”, which answered all my questions.

Apparently the veterinarian, the yachts, and all the rest, are just a pretext to get you involved in a conversation where the scammers then befriend you before stealing as much of your money as they can. Kinda mean, huh? Can’t they do something more socially beneficial, like do some politically incorrect p-hacking or something involving soup bowls or paper shredders? Or just plagiarize a book about giraffes?

Hey, here’s some free money for you! Just lend your name to this university and they’ll pay you $1000 for every article you publish!

Remember that absolutely ridiculous claim that scientific citations are worth $100,000 each?

It appears that someone is taking this literally. Or, nearly so. Nick Wise has the story:

A couple of months ago a professor received the following email, which they forwarded to me.

Dear esteemed colleagues,

We are delighted to extend an invitation to apply for our prestigious remote research fellowships at the University of Religions and Denominations (URD) . . . These fellowships offer substantial financial support to researchers with papers currently in press, accepted or under review by Scopus-indexed journals. . . .

Fellowship Type: Remote Short-term Research Fellowship. . . .

Affiliation: Encouragement for researchers to acknowledge URD as their additional affiliation in published articles.

Remuneration: Project-based compensation for each research article.

Payment Range: Up to $1000 USD per article (based on SJR journal ranking). . . .

Why would the institution pay researchers to say that they are affiliated with them? It could be that funding for the university is related to the number of papers published in indexed journals. More articles associated with the university can also improve their placing in national or international university rankings, which could lead directly to more funding, or to more students wanting to attend and bringing in more money.

The University of Religions and Denominations is a private Iranian university . . . Until recently the institution had very few published papers associated with it . . . and their subject matter was all related to religion. . . . However, last year there was a substantial increase to 103 published papers, and so far this year there are already 35. This suggests that some academics have taken them up on the offer in the advert to include URD as an affiliation.

Surbhi Bhatia Khan is a lecturer in data science at the University of Salford in the UK since March 2023 and a top 2% scientist in the world according to Stanford University’s rankings. She published 29 research articles last year according to Dimensions, an impressive output, in which she was primarily affiliated to the University of Salford. In addition though, 5 of those submitted in the 2nd half of last year had an additional affiliation at the Department of Engineering and Environment at URD, which is not listed as one of the departments on the university website. Additionally, 19 of the 29 state that she’s affiliated to the Lebanese American University in Beirut, which she was not affiliated with before 2023. She is yet to mention her role at either of these additional affiliations on her LinkedIn profile.

Looking at the Lebanese American University, another private university, its publication numbers have shot up from 201 in 2015 to 503 in 2021 and 2,842 in 2023, according to Dimensions. So far in 2024 they have published 525, on track for over 6,000 publications for the year. By contrast, according to the university website, the faculty consisted of 547 full-time staff members in 2021 but had shrunk to 423 in 2023. It is hard to imagine how such growth in publication numbers could occur without a similar growth in the faculty, let alone with a reduction.

Wise writes:

How many other institutions are seeing incredible increases in publication numbers? Last year we saw gaming of the system on a grand scale by various Saudi Arabian universities, but how many offers like the one above are going around, whether by email or sent through Whatsapp groups or similar?

It’s bad news when universities in England, Iran, Saudi Arabia, and Lebanon start imitating the corrupt citation practices that we have previously associated with nearby Cornell University.

But I can see where Dr. Khan is coming from: if someone’s gonna send you free money, why not take it? Even if the “someone” is a University of Religions and Denominations, and none of your published research relates to religion, and you list an affiliation with an apparently nonexistent department.

The only thing that’s bugging me is that, according to an esteemed professor at Northeastern University, citations are worth $100,000 each—indeed, we are told that it is possible to calculate “exactly how much a single citation is worth.” In that case, Dr. Khan is getting ripped off by University of Religions and Denominations, who are offering a paltry “up to $1000”—and that’s per article, not per citation! I know about transaction costs etc. but maybe she could at least negotiate them up to $2000 per.

I can’t imagine this scam going on for long, but while it lasts you might as well get in on it. Why should professors at Salford University have all the fun?

Parting advice

Just one piece of advice for anyone who’s read this far down into the post: if you apply for the “Remote Short-term Research Fellowship” and you get it, and you send them the publication notice for your article that includes your affiliation with the university, and then they tell you that they’ll be happy to send you a check for $1000, you just have to wire them a $10 processing fee . . . don’t do it!!!

“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates.

This one’s important.

Matt Lerner points us to this report by Rosie Bettle, Replicability & Generalisability: A Guide to CEA discounts.

“CEA” is cost-effectiveness analysis, and by “discounts” they mean what we’ve called the Edlin factor—“discount” is a better name than factor, because it’s a number that should be between 0 and 1, it’s what you should multiply a point estimate by to adjust for inevitable upward biases in reported effect-size estimates, issues discussed here and here, for example.

It’s pleasant to see some of my ideas being used for a practical purpose. I would just add that type M and type S errors should be lower for Bayesian inferences than for raw inferences that have not been partially pooled toward a reasonable prior model.

Also, regarding empirical estimation of adjustment factors, I recommend looking at the work of Erik van Zwet et al; here are some links:
What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?
How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”
The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments
Erik van Zwet explains the Shrinkage Trilogy
The significance filter, the winner’s curse and the need to shrink
Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”
Explaining that line, “Bayesians moving from defense to offense”

I’m excited about the application of these ideas to policy analysis.

The importance of measurement, and how you can draw ridiculous conclusions from your statistical analyses if you don’t think carefully about measurement . . . Leamer (1983) got it.

Screen Shot 2013-08-03 at 4.23.29 PM

Jacob Klerman writes:

I have noted your recent emphasis on the importance of measurement (e.g., “Here are some ways to make your study replicable…”). For reasons not relevant here, I was rereading Leamer (1983), Let’s Take the Con Out of Econometrics—now 40 years old. It’s a fun, if slightly dated, paper that you seem to be aware of.

Leamer also makes the measurement point (emphasis added):

When the sampling uncertainty S gets small compared to the misspecification uncertainty M ,it is time to look for other forms of evidence, experiments or nonexperiments. Suppose I am interested in measuring the width of a coin. and I provide rulers to a room of volunteers. After each volunteer has reported a measurement, I compute the mean and standard deviation, and I conclude that the coin has width 1.325 millimeters with a standard error of .013. Since this amount of uncertainty is not to my liking, I propose to find three other rooms full of volunteers, thereby multiplying the sample size by four, and dividing the standard error in half. That is a silly way to get a more accurate measurement, because I have already reached the point where the sampling uncertainty S is very small compared with the misspecification uncertainty M. If I want to increase the true accuracy of my estimate, it is time for me to consider using a micrometer. So to in the case of diet and heart disease. Medical researchers had more or less exhausted the vein of nonexperimental evidence, and it became time to switch to the more expensive but richer vein of experimental evidence.

Interesting. Good to see examples where ideas we talk about today were already discussed in the classic literature. I indeed thing measurement is important and is under-discussed in statistics. Economists are very familiar with the importance of measurement, both in theory (textbooks routinely discuss the big challenges in defining, let alone measuring, key microeconomic quantities such as “the money supply”) and in practice (data gathering can often be a big deal, involving archival research, data quality checking, etc., even if unfortunately this is not always done), but then once the data are in, data quality and issues of bias and variance of measurement often seem to be forgotten. Consider, for example, this notorious paper where nobody at any stage in the research, writing, reviewing, revising, or editing process seemed to be concerned about that region with a purported life expectancy of 91 (see the above graph)—and that doesn’t even get into the bizarre fitted regression curve. But, hey, p less than 0.05. Publishing and promoting such a result based on the p-value represents some sort of apogee of trusting implausible theory over realistic measurement.

Also, if you want a good story about why it’s a mistake to think that your uncertainty should just go like 1/sqrt(n), check out this story which is also included in our forthcoming book, Active Statistics.

“My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. . . . What I hear, instead, is the following . . .”

Economic historian Tim Guinnane writes:

I have a general question that I have not seen addressed on your blog. Often this question turns into a narrow question about retracting papers, but I think that short-circuits an important discussion.

Like many in economic history, I am increasingly worried that much research in recent years reflects p-hacking, misrepresentation of the history, useless data, and other issues. I realize that the technical/statistical issues differ from paper to paper.

What I see is something like the following. You can use this paper as a concrete example, but the problems are much more widespread. We document a series of bad research practices. The authors played games with controls to get the “right” answer for the variable of interest. (See Table 1 of the paper). In the text they misrepresent the definitions of variables used in regressions; we show that if you use the stated definition, their results disappear. They use the wrong degrees of freedom to compute error bounds (in this case, they had to program the bounds by hand, since stata automatically uses the right df). There are other and to our minds more serious problems involved in selectively dropping data, claiming sources do not exist, etc.

Step back from any particular problem. How should the profession think about claims such as ours? My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. The journals may not want to retract such work, but there should be support for publishing articles that point out such problems.

What I hear, instead, is the following. A paper estimates beta as .05 with a given SE. Even if we show that this is cooked—that is, that beta is a lot smaller or the SE a lot larger if you do not throw in extraneous regressors, or play games with variable definitions—then ours is not really a result. It is instead, I am told, incumbent on the critic to start with beta=.05 as the null, and show that doing things correctly rejects that null in favor of something less than .05 (it is characteristic of most of this work that there really is no economic theory, so the null is always “X does not matter” which boils down to “this beta is zero.” And very few even tell us whether the correct test is one- or two-sided).

This pushback strikes me as weaponizing the idea of frequentist hypothesis testing. To my mind, if I can show that beta=.05 comes from a cooked regression, then we need to start over. That estimate can be ignored; it is just one of many incorrect estimates one can generated by doing things inappropriately. It actually gives the unscrupulous an incentive to concoct more outlandish betas which are then harder to reject. More generally, it puts a strange burden of proof on critics. I have discussed this issue with some folks in natural sciences who find the pushback extremely difficult to understand. They note what I think is the truth: it encourages bad research behavior by suppressing papers that demonstrate that bad behavior.

It might be opportune to have a general discussion of these sorts of issues on your website. The Gino case raises something much simpler, I think. I fear that it will in some ways lower the bar: so long as someone is not actively making up their data (which I realize has not been proven, in case this email gets subpoenaed!) then we do not need to worry about cooking results.

My reply: You raise several issues that we’e discussed on occasion (for some links, see here):

1. The “Research Incumbency Rule”: Once an article is published in some approved venue, it is taken as truth. Criticisms which would absolutely derail a submission in pre-publication review can be brushed aside if they are presented after publication. This is what you call “the burden of proof on critics.”

2. Garden of forking paths.

3. Honesty and transparency are not enough. Work can be non-fraudulent but still be crap.

4. “Passive corruption” when people know there’s bad work but they don’t do anything about it.

5. A disturbingly casual attitude toward measurement; see here for an example: https://statmodeling.stat.columbia.edu/2023/10/05/no-this-paper-on-strip-clubs-and-sex-crimes-was-never-gonna-get-retracted-also-a-reminder-of-the-importance-of-data-quality-and-a-reflection-on-why-researchers-often-think-its-just-fine-to-publ/ Many economists and others seem to have been brainwashed into thinking that it’s ok to have bad measurement because attenuation bla bla . . . They’re wrong.

He responded: If you want an example of economists using stunningly bad data and making noises about attenuation, see here.

The paper in question has the straightforward title, “We Do Not Know the Population of Every Country in the World for the Past Two Thousand Years.”

A feedback loop can destroy correlation: This idea comes up in many places.

The people who go by “Slime Mold Time Mold” write:

Some people have noted that not only does correlation not imply causality, no correlation also doesn’t imply no causality. Two variables can be causally linked without having an observable correlation. Two examples of people noting this previously are Nick Rowe offering the example of Milton Friedman’s thermostat and Scott Cunningham’s Do Not Confuse Correlation with Causality chapter in Causal Inference: The Mixtape.

We realized that this should be true for any control system or negative feedback loop. As long as the control of a variable is sufficiently effective, that variable won’t be correlated with the variables causally prior to it. We wrote a short blog post exploring this idea if you want to take a closer look. It appears to us that in any sufficiently effective control system, causally linked variables won’t be correlated. This puts some limitations on using correlational techniques to study anything that involves control systems, like the economy, or the human body. The stronger version of this observation, that the only case where causally linked variables aren’t correlated is when they are linked together as part of a control system, may also be true.

Our question for you is, has anyone else made this observation? Is it recognized within statistics? (Maybe this is all implied by Peston’s 1972 “The Correlation between Targets and Instruments”? But that paper seems totally focused on economics and has only 14 citations. And the two examples we give above are both economists.) If not, is it worth trying to give this some kind of formal treatment or taking other steps to bring this to people’s attention, and if so, what would those steps look like?

My response: Yes, this has come up before. It’s a subtle point, as can be seen in some of the confused comments to this post. In that example, the person who brought up the feedback-destroys-correlation example was economist Rachael Meager, and it was a psychologist, a law professor, and some dude who describes himself as “a professor, writer and keynote speaker specializing in the quality of strategic thinking and the design of decision processes” who missed the point. So it’s interesting that you brought up an example of feedback from the economics literature.

Also, as I like to say, correlation does not even imply correlation.

The point you are making about feedback is related to the idea that, at equilibrium in an idealized setting, price elasticity of demand should be -1, because if it’s higher or lower than that, it would make sense to alter the price accordingly and slide up or down that curve to maximize total $.

I’m not up on all this literature; it’s the kind of thing that people were writing about a lot back in the 1950s related to cybernetics. It’s also related to the idea that clinical trials exist on a phase transition where the new treatment exists but has not yet been determined to be better or worse than the old. This is sometimes referred to as “equipoise,” which I consider to be a very sloppy concept.

The other thing is that everybody knows how correlations can be changed by selection (Simpson’s paradox, the example of high school grades and SAT scores among students who attend a moderately selective institution, those holes in the airplane wings, etc etc.). Knowing about one mechanism for correlations to be distorted can perhaps make people less attuned to other mechanisms such as the feedback thing.

So, yeah, a lot going on here.

God is in every leaf of every tree—comic book movies edition.

Mark Evanier writes:

Martin Scorsese has directed some of the best movies ever made and most of them convey some powerful message with skill and depth. So it’s odd that when he complains about “comic book movies” and says they’re a danger to the whole concept of cinema, I have no idea what the f-word he’s saying. . . .

Mr. Scorsese is acting like “comic book movies” are some new thing. Just to take a some-time-ago decade at random, the highest grossing movie of 1980 was Star Wars: Episode V — The Empire Strikes Back. The highest-grossing movie of 1981 was Superman II. The highest of 1982 was E.T. the Extra-Terrestrial and the highest-grossing movies of the following years were Star Wars: Episode VI — Return of the Jedi, Ghostbusters, Back to the Future, Top Gun, Beverly Hills Cop II, Who Framed Roger Rabbit and Batman.

I dunno about you but I’d call most of those “comic book movies.” And now here we have Scorsese saying of the current flock, “The danger there is what it’s doing to our culture…because there are going to be generations now that think movies are only those — that’s what movies are.” . . .

This seems like a statistical problem, and I imagine some people have studied this more carefully. Evanier seems to be arguing that comic book movies are no bigger of a thing now than they were forty years ago. There must be some systematic analysis of movie genres over time that could address this question.

Discover Instagram Stories with These Tools

Explore Instagram stories differently with these services:

Instagram Story Viewer – Dive into a vibrant world with Mollygram’s unique interface.
Insta Stories – Enhance your experience with innovative tools and captivating content.
Stories IG – Your gateway to a plethora of Instagram stories and trends.

What to trust in the newspaper? Example of “The Simple Nudge That Raised Median Donations by 80%”

Greg Mayer points to this news article, “The Simple Nudge That Raised Median Donations by 80%,” which states:

A start-up used the Hebrew word “chai” and its numerical match, 18, to bump up giving amounts. . . . It’s a common donation amount among Jews — $18, $180, $1,800 or even $36 and other multiples.

So Daffy lowered its minimum gift to $18 and then went further, prompting any donor giving to any Jewish charity to bump gifts up by some related amount. Within a year, median gifts had risen to $180 from $100. . . .

I see several warning signs here:

1. “Within a year, median gifts had risen to $180 from $100.” This is a before/after change, not a direct comparison of outcomes.

2. No report, just a quoted number which could easily have been made up. Yes, the numbers in a report can be fabricated too, but that takes more work and is more risk. Making up numbers when talking with a reporter, that’s easy.

3. The people who report the number are motivated to claim success; the reporter is motivated to report a success. The article is filled with promotion for this company. It’s a short article that mentions “Daffy” 6 times in the short article, for example this bit which reads like a straight-up ad:

If you have children, grandchildren, nieces or nephews, there’s another possibility. Daffy has a family plan that allows children to prompt their adult relatives to support a cause the children choose. Why not put the app on their iPhones or iPads so they can make suggestions and let, for example, a 12-year-old make $12 donations to 12 nonprofits each year?

Why not, indeed? Even better, why not have them make their donations directly to Daffy and cut out the middleman?? Look, I’m not saying that the people behind Daffy are doing anything wrong; it’s just that this is public relations, not journalism.

4. Use of the word “nudge” in the headline is consistent with business-press hype. Recall that “nudge” is a subfield whose proponents are well connected in the media and routinely make exaggerated claims.

So, yeah, an observational comparison with no documentation, in an article that’s more like an advertisement, that’s kinda sus. Not that the claim is definitely wrong, there’s just no good reason for us to take it seriously.

Here’s a sad post for you to start the new year. The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals.

How horrible. I remember when The Onion started. They were so funny and on point. And now . . . What’s the point of even having The Onion if it’s running plagiarized material? I mean, yeah, sure, everybody’s gotta bring home money to put food on the table. But, really, what’s the goddam point of it all?

Jonathan Bailey has the story:

Back in June, G/O Media, the company that owns A.V. Club, Gizmodo, Quarts and The Onion, announced that they would be experimenting with AI tools as a way to supplement the work of human reporters and editors.

However, just a week later, it was clear that the move wasn’t going smoothly. . . . several months later, it doesn’t appear that things have improved. If anything, they might have gotten worse.

The reason is highlighted in a report by Frank Landymore and Jon Christian at Futurism. They compared the output of A.V. Club’s AI “reporter” against the source material, namely IMDB. What they found were examples of verbatim and near-verbatim copying of that material, without any indication that the text was copied. . . .

The articles in question have a note that reads as follows: “This article is based on data from IMDb. Text was compiled by an AI engine that was then reviewed and edited by the editorial staff.”

However, as noted by the Futurism report, that text does not indicate that any text is copied. Only that “data” is used. The text is supposed to be “compiled” by the AI and then “reviewed and edited” by humans. . . .

In both A.V. Club lists, there is no additional text or framing beyond the movies and the descriptions, which are all based on IMDb descriptions and, as seen in this case, sometimes copied directly or nearly directly from them.

There’s not much doubt that this is plagiarism. Though A.V. Club acknowledges that the “data” came from IMDb, it doesn’t indicate that the language does. There are no quotation marks, no blockquotes, nothing to indicate that portions are copied verbatim or near-verbatim. . . .

Bailey continues:

None of this is a secret. All of this is well known, well-understood and backed up with both hard data and mountains of anecdotal evidence. . . . But we’ve seen this before. Benny Johnson, for example, is an irredeemably unethical reporter with a history of plagiarism, fabrication and other ethical issues that resulted in him being fired from multiple publications.

Yet, he’s never been left wanting for a job. Publications know that, because of his name, he will draw clicks and engagement. . . . From a business perspective, AI is not very different from Benny Johnson. Though the flaws and integrity issues are well known, the allure of a free reporter who can generate countless articles at the push of a button is simply too great to ignore.

Then comes the economic argument:

But in there lies the problem, if you want AI to function like an actual reporter, it has to be edited, fact checked and plagiarism checked just like a real human.

However, when one does those checks, the errors quickly become apparent and fixing them often takes more time and resources than just starting with a human author.

In short, using an AI in a way that helps a company earn/save money means accepting that the factual errors and plagiarism are just part of the deal. It means completely forgoing journalism ethics, just like hiring a reporter like Benny Johnson.

Right now, for a publication, there is no ethical use of AI that is not either unprofitable or extremely limited. These “experiments” in AI are not about testing what the bots can do, but about seeing how much they can still lower their ethical and quality standards and still find an audience.

Ouch.

Very sad to see an Onion-affiliated site doing this.

Here’s how Bailey concludes:

The arc of history has been pulling publications toward larger quantities of lower quality content for some time. AI is just the latest escalation in that trend, and one that publishers are unlikely to ignore.

Even if it destroys their credibility.

No kidding. What next, mathematics professors who copy stories unacknowledged, introduce errors, and then deny they ever did it? Award-winning statistics professors who copy stuff from wikipedia, introducing stupid-ass errors in the process? University presidents? OK, none of those cases were shocking, they’re just sad. But to see The Onion involved . . . that truly is a step further into the abyss.

Uh oh Barnard . . .

Paul Campos tells a story that starts just fine:

In fiscal year 2013, the [University of Florida law] school was pulling in about $36.6 in tuition revenue (2022 dollars). Other revenue, almost all of which consisted of endowment income, pushed the school’s total self-generated revenue to around $39 million in constant dollars.

At the time, the law school had 62 full time faculty members, which meant it was generating around $645,000 per full-time faculty member. The school’s total payroll, including benefits, totaled about 64% of its self-generated income. This was, from a budgetary perspective, a pretty healthy situation. . . .

And then it turns ugly:

Shortly afterwards, a proactive synergistic visionary named Laura Rosenbury became dean, and things started to change . . . a lot.

Rosenbury was obsessed with improving the law school’s ranking in the idiotic US News tables. Central to this vision was raising the LSAT and GPA scores of the law school’s entering students. In recent years, prospective law students have learned they can drive a hard bargain with schools like Florida, which were desperate to buy high LSAT scores in particular, because of the weight these numbers are given in the US News rankings formula. In order to improve these metrics Rosenbury had to convince the central administration to radically slash the effective tuition UF’s law students were paying . . .

The result was that, by fiscal year 2022, tuition revenue at the law school had fallen from $36.6 million to $8.3 million, in constant dollars — an astonishing 77% decline.

Rosenbury also increased the size of the full time faculty from 62 to 84, while slashing the size of the student body, with the result being that revenue per full time faculty member fell from around $645,000 to about $178,000. This meant that the school’s revenue was now not much more than half of its payroll, let alone the rest of its expenses.

Although Rosenbury managed to raise some money from donors while shilling her “vision” of a highly ranked law school. the net result was that, by the end of her deanship, the law school’s total self-generated revenue was covering just 40% of its operating costs, approximately.

By gaming the rankings in countless and in some instances ethically dubious ways — for example, she claimed that the school’s part time faculty expanded from 30 to 259 — the latter figure should be completely impossible under the ABA’s definitions of who can be counted as part time faculty — she did manage to raise the law school’s ranking quite a bit. This has not resulted in better job outcomes for Florida’s graduates relative to its local competitor schools, whose rankings didn’t improve, but it has allowed central administrators to advertise to their regents and legislators that UF now has a “top 25” law school . . .

And, now, the kicker. Campos quotes the New York Times:

Barnard College of Columbia University, one of the most prominent women’s colleges in the United States, announced on Thursday that it had chosen Laura A. Rosenbury, the dean of the University of Florida Levin College of Law, to serve as its next president.

Ms. Rosenbury became the first woman to serve as dean of Levin College of Law, in Gainesville, Fla., in 2015. She also taught classes in feminist legal theory, employment discrimination and family law. . . .

Columbia University hiring an administrator who supplied dubious data to raise a college’s U.S. News rankings . . . there’s some precedent for that!

Campos is only supplying one perspective, though. To learn more I took a look at the above-linked NYT article, which supplies the following information about the recent administrative accouplishments of Prof. Rosenbury, the new president of Barnard:

Ms. Rosenbury oversaw a period of growth at the University of Florida, raising more than $100 million in donations, hiring 39 new faculty members and increasing the number of applicants by roughly 200 percent, Barnard said. . . .

“I have been able to continue to do my work while also moving the law school forward in really exciting ways with the support of the state,” [Rosenbury] said. “The state has really invested in higher education in Florida including in the law school, and that is how we have been able to raise our national profile.”

What’s interesting here is that these paragraphs, while phrased in a positive way, are completely consistent with Campos’s points:

1. “More than $100 million in donations”: A ton of money, but not enough to make up for the huge drop in revenue.

2. “Hiring 39 new faculty members”: An odd thing to do if you’re gonna reduce the number of students.

3. “Increasing the number of applicants by roughly 200 percent”: Not really a plus to have more applicants for fewer slots. I mean, sure, if more prospective students want to apply, go for it, but if the way you get there is by offering huge discounts, it’s not an amazing accomplishment.

4. “The state has really invested in higher education in Florida including in the law school”: If you go from a profit center to a money pit, and then the state needs to bail you out, then, yeah, the state is “investing” . . . kind of.

On the plus side, if Columbia has really been ejected from U.S. News, the new Barnard president won’t have any particular motivation to game the rankings. I do worry, though, that she’ll find a way to raise costs while reducing revenues.

Also, to be fair, budgets aren’t the only thing that a university president does. Perhaps Prof. Rosenbury’s intellectual contributions are important enough to be worth the financial hit. I have no idea and make no claim on this, one way or the other.

AI bus route fail: Typically the most important thing is not how you do the optimization but rather what you decide to optimize.

Robert Farley shares this amusing story of a city that contracted out the routing of its school buses to a company that “uses artificial intelligence to generate the routes with the intent of reducing the number of routes. Last year, JCPS had 730 routes last year, and that was cut to 600 beginning this year . . .” The result was reported to be a “transportation disaster.”

I don’t know if you can blame AI here . . . Reducing the number of routes by over 15%, that’s gonna be a major problem! To first approximation we might expect routes to be over 15% longer, but that’s just an average: you can bet it will be much worse for some routes. No surprise that the bus drivers hate it.

As Farley says, “In theory, developing a bus route algorithm is something that AI could do well . . . [to] optimize the incredibly difficult problem of getting thousands of kids to over 150 schools in tight time windows,” but:

1. Effective problem solving for the real world requires feedback, and it’s not clear that any feedback was involved in this system: the company might have just taken the contract, run their program, and sent the output to the school district without ever checking that the results made sense, not to mention getting feedback from bus drivers and school administrators. I wonder how many people at the company take the bus to work themselves every day!

2. It sounds like the goal was to reduce the number of routes, not to produce routes that worked. If you optimize on factor A, you can pay big on factor B. Again, this is a reason for getting feedback and solving the problem iteratively.

3. Farley describes the AI solution as “high modernist thinking.” That’s a funny and insightful way to put it! I have no idea what sort of “artificial intelligence” was used in this bus routing program. It’s an optimization problem, and typically the most important thing is not how you do the optimization but rather what you decide to optimize.

In that sense, the biggest problem with “AI” here is not that it led to a bad solution—if you try to optimize the wrong thing, I’d guess that any algorithm not backed up by feedback will fail—but rather that it had an air of magic which led people to accept its results unquestioningly. “AI,” like “Bayesian,” can serve as a slogan that leads people to turn off their skepticism. They might as well have said they used quantum computing or room-conductor superconductors or whatever.

I guess the connection to “high modernist thinking” is (a) the idea that we can and should replace the old with the new, clear the “slums” and build clean shiny new buildings, etc., and (b) the idea of looking only at surfaces, kinda like how Theranos conned lots of people by building fake machines that looked like clean Apple-brand devices. In this case, I have no reason to think the bus routing program is a con; it sounds more like an optimization program plus good marketing, and this was just one more poorly-planned corporate/government contract, with “AI” just providing a plausible cover story.

Hey, check this out! Here’s how to read and then rewrite the title and abstract of a paper.

In our statistical communication class today, we were talking about writing. At some point a student asked why it was that journal articles are all written in the same way. I said, No, actually there are many different ways to write a scientific journal article. Superficially these articles all look the same: title, abstract, introduction, methods, results, discussion, or some version of that, but if you look in detail you’ll see that you have lots of flexibility in how to do this (with the exception of papers in medical journals such as JAMA which indeed have a pretty rigid format).

The next step was to demonstrate the point by going to a recent scientific article. I asked the students to pick a journal. Someone suggested NBER. So I googled NBER and went to its home page:

I then clicked on the most recent research paper, which was listed on the main page as “Employer Violations of Minimum Wage Laws.” Click on the link and you get this more dramatically-titled article:

Does Wage Theft Vary by Demographic Group? Evidence from Minimum Wage Increases

with this abstract:

Using Current Population Survey data, we assess whether and to what extent the burden of wage theft — wage payments below the statutory minimum wage — falls disproportionately on various demographic groups following minimum wage increases. For most racial and ethnic groups at most ages we find that underpayment rises similarly as a fraction of realized wage gains in the wake of minimum wage increases. We also present evidence that the burden of underpayment falls disproportionately on relatively young African American workers and that underpayment increases more for Hispanic workers among the full working-age population.

We actually never got to the full article (but feel free to click on the link and read it yourself). There was enough in the title and abstract to sustain a class discussion.

Before going on . . .

In class we discussed the title and abstract of the above article and considered how it could be improved. This does not mean we think the article, or its title, or its abstract, is bad. Just about everything can be improved! Criticism is an important step in the process of improvement.

The title

“Does Wage Theft Vary by Demographic Group? Evidence from Minimum Wage Increases” . . . that’s not bad! “Wage Theft” in the first sentence is dramatic—it grabs our attention right away. And the second sentence is good too: it foregrounds “Evidence” and it also tells you where the identification is coming from. So, good job. We’ll talk later about how we might be able to do even better, but I like what they’ve got so far.

Just two things.

First, the answer to the question, “Does X vary with Y?”, is always Yes. At least, in social science it’s always Yes. There are no true zeroes. So it would be better to change that first sentence to something like, “How Does Wage Theft Vary by Demographic Group?”

The second thing is the term “wage theft.” I took that as a left-wing signifier, the same way in which the use of a loaded term such as “pro-choice” or “pro-life” conveys the speaker’s position on abortion. So I took the use of that phrase in the title as a signal that the article is taking a position on the political/economic left. But then I googled the first author, and . . . he’s an “Adjunct Senior Fellow at the Hoover Institution.” Not that everyone at Hoover is right-wing, but it’s not a place I associate with the left, either. So I’ll move on and not worry about this issue.

The point here is not that I’m trying to monitor the ideology of economics papers. This is a post on how to write a scholarly paper! My point is that the title conveys information, both directly and indirectly. The term “wage theft” in the title conveys that the topic of the paper will be morally serious—they’re talking about “theft,” not just some technical violations of a law—; also it has this political connotation. When titling your papers, be aware of the direct and indirect messages you’re conveying.

The abstract

As I said, I liked the title of the paper—it’s punchy and clear. The abstract is another story. I read it and then realized I hadn’t absorbed any of its content, so I read it again, and it was still confusing. It’s not “word salad”—there’s content in that abstract—; it’s just put together in a way that I found hard to follow. The students in the class had the same impression, and indeed they were kinda relieved that I too found it confusing.

How to rewrite? The best approach would be to go into the main paper, maybe start with our tactic of forming an abstract by taking the first sentence of each of the first five paragraphs. But here we’ll keep it simple and just go with the information right there in the current abstract. Our goal is to rewrite in a way that makes it less exhausting to read.

Our strategy: First take the abstract apart, then put it back together.

I went to the blackboard and listed the information that was in the abstract:
– CPS data
– Definition of wage theft
– What happens after minimum wage increase
– Working-age population
– African American, Hispanic, White

Now, how to put this all together? My first thought was to just start with the definition of wage theft, but then I checked online and learned that the phrase used in the abstract, “wage payments below the statutory minimum wage,” is not the definition of wage theft; it’s actually just one of several kinds of wage theft. So that wasn’t going to work. Then there’s the bit from the abstract, “falls disproportionately on various demographic groups”—that’s pretty useless, as what we want to know is where this disproportionate burden falls, and by how much.

Putting it all together

We discussed some more—it took surprisingly long, maybe 20 minutes of class time to work through all these issues—and then I came up with this new title/abstract:

Wage theft! Evidence from minimum wage increases

Using Current Population Survey data from [years] in periods following minimum wage increase, we look at the proportion of workers being paid less than the statutory minimum, comparing different age groups and ethnic groups. This proportion was highest in ** age and ** ethnic groups.

OK, how is this different from the original?

1. The three key points of the paper are “wage theft,” “evidence,” and “minimum wage increases,” so that’s now what’s in the title.

2. It’s good to know that the data came from the Current Population Survey. We also want to know when this was all happening, so we added the years to the abstract. Also we made the correction of changing the tense in the abstract from the present to the past, because the study is all based on past data.

3. The killer phrase, “wage theft,” is already in the title, so we don’t need it in the abstract. That helps, because then we can use the authors’ clear and descriptive phrase, “the proportion of workers being paid less than the statutory minimum,” without having to misleadingly imply that this is the definition of wage theft, and without having to lugubriously state that it’s a kind of wage theft. That was so easy!

4. We just say we’re comparing different age and ethnic groups and then report the results. This to me is much cleaner than the original abstract which shared this information in three long sentences, with quite a bit of repetition.

5. We have the ** in the last sentence because I’m not quite clear from the abstract what are the take-home points. The version we created is short enough that we could add more numbers to that last sentence, or break it up into two crisp sentences, for example, one sentence about age groups and one about ethnic groups.

In any case, I think this new version is much more readable. It’s a structure much better suited to conveying, not just the general vibe of the paper (wage theft, inequality, minority groups) but the specific findings.

Lessons for rewriters

Just about every writer is a rewriter. So these lessons are important.

We were able to improve the title and abstract, but it wasn’t easy, nor was it algorithmic—that is, there was no simple set of steps to follow. We gave ourselves the relatively simple task of rewriting without the burden of subject-matter knowledge, and it still took a half hour of work.

After looking over some writing advice, it’s tempting to think that rewriting is mostly a matter of a few clean steps: replacing the passive with the active voice, removing empty words and phrases such as “quite” and “Note that,” checking for grammar, keeping sentences short, etc. In this case, no. In this case, we needed to dig in a bit and gain some conceptual understanding to figure out what to say.

The outcome, though, is positive. You can do this too, for your own papers!

The problem with p-values is how they’re used

The above-titled article is from 2014. Key passage:

Hypothesis testing and p-values are so compelling in that they fit in so well with the Popperian model in which science advances via refutation of hypotheses. For both theoretical and practical reasons I am supportive of a (modified) Popperian philosophy of science in which models are advanced and then refuted. But a necessary part of falsificationism is that the models being rejected are worthy of consideration. If a group of researchers in some scientific field develops an interesting scientific model with predictive power, then I think it very appropriate to use this model for inference and to check it rigorously, eventually abandoning it and replacing it with something better if it fails to make accurate predictions in a definitive series of experiments. This is the form of hypothesis testing and falsification that is valuable to me. In common practice, however, the “null hypothesis” is a straw man that exists only to be rejected. In this case, I am typically much more interested in the size of the effect, its persistence, and how it varies across different situations. I would like to reserve hypothesis testing for the exploration of serious hypotheses and not as in indirect form of statistical inference that typically has the effect of reducing scientific explorations to yes/no conclusions.

The logical followup is that article I wrote the other day, Before data analysis: Additional recommendations for designing experiments to learn about the world.

But the real reason I’m bringing up this old paper is to link to this fun discussion revolving around how the article never appeared in the journal that invited it, because I found out they wanted to charge me $300 to publish it, and I preferred to just post it for free. (OK, not completely free; it does cost something to maintain these sites, but the cost is orders of magnitude less than $300 for 115 kilobytes of content.)

We were gonna submit something to Nature Communications, but then we found out they were charging $6290 for publication. For that amount of money, we could afford 37% of an invitation to a conference featuring Grover Norquist, Gray Davis, and a rabbi, or 1/160th of the naming rights for a sleep center at the University of California, or 4735 Jamaican beef patties.

My colleague and I wrote a paper, and someone suggested we submit it to the journal Nature Communications. Sounds fine, right? But then we noticed this:

Hey! We wrote the damn article, right? They should be paying us to publish it, not the other way around. Ok, processing fees yeah yeah, but $6290??? How much labor could it possibly take to publish one article? This makes no damn sense at all. I guess part of that $6290 goes to paying for that stupid website where they try to con you into paying several thousand dollars to put an article on their website that you can put on Arxiv for free.

Ok, then the question arises: What else could we get for that $6290? A trawl through the blog archive gives some possibilities:

– 37% of an invitation to a conference featuring Grover Norquist, Gray Davis, and a rabbi

– 1/160th of the naming rights for a sleep center at the University of California

– 4735 Jamaican beef patties

I guess that, among all these options, the Nature Communications publication would do the least damage to my heart. Still, I couldn’t quite bring myself to commit to forking over $6290. So we’re sending the paper elsewhere.

At this point I’m still torn between the other three options. 4735 Jamaican beef patties sounds good, but 1/160th of a sleep center named just for me, that would be pretty cool. And 37% of a chance to meet Grover Norquist, Gray Davis, and a rabbi . . . that’s gotta be the most fun since Henry Kissinger’s 100th birthday party. (Unfortunately I was out of town for that one, but I made good use of my invite: I forwarded it to Kissinger superfan Cass Sunstein, and it seems he had a good time, so nothing was wasted.) So don’t worry, that $6290 will go to a good cause, one way or another.

Springboards to overconfidence: How can we avoid . . .? (following up on our discussion of synthetic controls analysis)

Following up on our recent discussion of synthetic control analysis for causal inference, Alberto Abadie points to this article from 2021, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects.

Abadie’s paper is very helpful in that it lays out the key assumptions and decision points, which can help us have a better understanding of what went so wrong in the paper on Philadelphia crime rates that we discussed in my earlier post.

I think it’s a general concern in methods papers–mine included!—that we tend to focus more on examples where the method works well, than on examples where it doesn’t. Abadie’s paper has an advantage over mine in that he gives conditions under which a method will work, and it’s not his fault that researchers then use the methods and get bad answers.

Regarding the specific methods issue, of course there are limits to what can be learned from N=1 treated units, whether analyzed using synthetic control or any other approach. It seems that researchers sometimes lose track of that point in their desire to make strong statements. On a very technical level, I suspect that, if researchers are using a weighted average as a comparison, that they’d do better using some regularization rather than just averaging over a very small number of other cases. But I don’t think that would help much in that particular application that we were discussing on the blog.

The deeper problem

The question is, when scholars such as Abadie write such clear descriptions of a method, including all its assumptions, how is it that applied researchers such as the authors of that Philadelphia article make such a mess of things? The problem is not unique to synthetic control analysis; it also arises with other “identification strategies” such as regression discontinuity, instrumental variables, linear regression, and plain old randomized experimentation. In all these cases, researchers often seem to end up using the identification strategy not as a tool for learning from data but rather as a sort of springboard to overconfidence. Beyond causal inference, there are all the well-known misapplications of Bayesian inference and classical p-values. No method is safe.

So, again, nothing special about synthetic control analysis. But what did happen in the example that got this discussion started? To quote from the original article:

The research question here is whether the application of a de-prosecution policy has an effect on the number of homicides for large cities in the United States. Philadelphia presents a natural experiment to examine this question. During 2010–2014, the Philadelphia District Attorney’s Office maintained a consistent and robust number of prosecutions and sentencings. During 2015–2019, the office engaged in a systematic policy of de-prosecution for both felony and misdemeanor cases. . . . Philadelphia experienced a concurrent and historically large increase in homicides.

After looking at the time series, here’s my quick summary: Philadelphia’s homicide rate went up since 2014 during the same period that it decreased prosecutions, and this was part of a national trend of increased homicides—but there’s no easy way given the directly available information to compare to other cities with and without that policy.

I’ll refer you to my earlier post and its comment thread for more on the details.

At this point, the authors of the original article used a synthetic controls analysis, following the general approach described in the Abadie paper. the comparisons they make are to that weighted average of Detroit, New Orleans, and New York. The trouble is . . . that’s just 3 cities, and homicide rates can vary a lot from city to city. There’s no good reason to think that an average of three cities that give you numbers comparable to Philadelphia’s for the homicide rates or counts in the five previous years will give you a reasonable counterfactual for trends in the next five years. Beyond this, some outside researchers pointed out many forking paths in the published analysis. Forking paths are not in themselves a problem—my open applied work is full of un-preregistered data coding and analysis decisions—; the relevance here is that they help explain how it’s possible for researchers to get apparently “statistically significant” results from noisy data.

So what went wrong? Abadie’s paper discusses a mathematical problem: if you want to compare Philadelphia to some weighted average of the other 96 cities, and if you want these weights to be positive and sum to 1 and be estimated using an otherwise unregularized procedure, then there are certain statistical properties associated with using a procedure which, in this case, if various decisions are made, will lead to choosing a particular average of Detroit, New Orleans, and New York. There’s nothing wrong with doing this, but, ultimately, all you have is a comparison of 1 city to 3 cities, and it’s completely legit from an applied perspective to look at these cities and recognize how different they all are.

It’s not the fault of the synthetic control analysis if you have N=1 in the treatment group. It’s just the way things go. The error is to use that analysis to make strong claims, and the further error is to think that the use of this particular method—or any particular method—should insulate the analysis from concerns about reasonableness. If you want to compare one city to 96 others, then your analysis will rely on assumptions about comparability of the different cities, and not just on one particular summary such as the homicide counts during a five-year period.

You can say that this general concern arises with linear regression as well—you’re only adjusting for whatever pre-treatment variables that are included in the model. For example, when we estimated the incumbency advantage in congressional elections by comparing elections with incumbents running for reelection to elections in open seats, adjusting for previous vote share and party control, it would be a fair criticism to say that maybe the treatment and control cases differed in other important ways not included in the analysis. And we looked at that! I’m not saying our analysis was perfect; indeed, a decade and a half later we reanalyzed the data with a measurement-error model and got what we thing were improved results. It was a big help that we had replication: many years, and many open-seat and incumbent elections in each year. This Philadelphia analysis is different because it’s N=1. If we tried to do linear regression with N=1, we’d have all sorts of problems. Unfortunately, the synthetic control analysis did not resolve the N=1 problem—it’s not supposed to!—but it did seem to lead the authors into a some strong claims that did not make a lot of sense.

P.S. I sent the above to Abadie, who added:

I would like to share a couple of thoughts about N=1 and whether it is good or bad to have a small number of units in the comparison group.

Synthetic controls were originally proposed to address the N=1 (or low N) setting in cases with aggregate and relatively noiseless data and strong co-movement across units. I agree with you that they do not mechanically solve the N=1 problem in general (and that nothing does!). They have to be applied with care and there will be settings where they do not produce credible estimates (e.g., noisy series, short pre-intervention windows, poor pre-intervention fit, poor prediction in hold-out pre-intervention windows, etc). There are checks (e.g., predictive power in hold-out pre-intervention windows) that help assess the credibility of synthetic control estimates in applied settings.

Whether a few controls or many controls are better depends on the context of the investigation and on what one is trying to attain. Precision may call for using many comparisons. But there is a trade-off. The more units we use as comparisons, the less similar those may be relative to the treated unit. And the use of a small number of units allows us to evaluate / correct for potential biases created by idiosyncratic shocks and / or interference effects on the comparison units. If the aggregate series are “noiseless enough” like in the synthetic control setting, one might care more about reducing bias than about attaining additional precision.