Polling without paying respondents: The tragedy—or, in this case, comedy—of the commons

Mark Evanier is a great storyteller. Here’s his story from 2011 about the unpromising topic of cold-call marketing surveys:

I [Evanier] don’t like salespersons or survey-takers who phone me. . . . I especially don’t want to answer questions from survey-takers because I figure they’re calling to build up a profile on me…and that profile will be used somehow to try and sell me things I don’t want.

Sometimes, I immediately ask the caller, “Is the last question about how much money this household makes?” Because they always want to know that and they usually hide it at the end. I’m not going to answer their questions either way but if that question’s in there, I’m especially not going to answer their questions.

The other day, a lady phoned and told me she was conducting a “brief survey” and would just need a few minutes of my time. Before I could ask her about the last question, she said, “For every survey that is completed, a donation will be made to the Susan G. Komen Breast Cancer Foundation”…which I believe is no longer even the name of that organization. So instantly I suspect they might not really be dealing in any way with the Foundation or maybe they’re keeping in the word “cancer” to ratchet up the sympathy. I asked the lady, “How much?” and from there on, it went pretty much like this. To her credit, she started giggling about halfway through . . .

You can click through to read the whole thing. The quick summary is: (a) it doesn’t sound like much of a donation is going on, and (b) the last question on the survey was indeed how much his household makes.

As Evanier says, this is not the fault of the survey interviewer (soon to be replaced by a robot, if that has not already happened). At the same time, it’s not Evanier’s fault that his time got wasted on this call. Ok, in this case not wasted because he got a good story out of it, which I’m now sharing with you. But, yeah, most of the time, it’s time wasted.

These things are scams on multiple levels. They con you into spending 45 minutes of your time giving them free dataa, they they con the companies that hire them into thinking they’re providing them with useful marketing intel. In this case there was the extra con of pretending they’d be meaningfully supporting a charity. The big problem is they want your survey responses but they don’t want to pay for it.

I wrote about this in one of our very first blog posts, back in 2004:

The U.S. is over-polled. You might have noticed this during the recent election campaign when national polls were performed roughly every 2 seconds. . . .

My complaint is not new, but this recent campaign was particularly irritating because it became commonplace for people to average batches of polls to get more accurate estimators. As news consumers, we’re like gluttons stuffing our faces with 5 potato chips at a time, just grabbing them out of the bag.

In recent years, as polling has proliferated, response rates have been going down. Why bother responding at all? Bob Groves and others have done research in this area. One reason to respond is to be helpful and civic-minded.

The recent proliferation of polls—whether for marketing or to just to sell newspapers—exploits people’s civic-mindedness. Polling and polling and polling until all the potential respondents get tired—it’s like draining the aquifer to grow alfalfa in the desert, or dredging all the crabs out of the bay—a short-sighted squandering of a resource that should be renewable.

I’d forgotten that poll averaging was already a thing back in 2004.

Freakonomics and global warming: What happens to a team of “rogues” when there is no longer a stable center to push against? (a general problem with edgelords)

A few years ago there was a cottage industry among some contrarian journalists, making use of the fact that 1998 was a particularly hot year (by the standards of its period) to cast doubt on the global warming trend. Ummmm, where did I see this? . . . Here, I found it! It was a post by Stephen Dubner on the Freakonomics blog, entitled, “A Headline That Will Make Global-Warming Activists Apoplectic,” and continuing:

The BBC is responsible. The article, by the climate correspondent Paul Hudson, is called “What Happened to Global Warming?” Highlights:

For the last 11 years we have not observed any increase in global temperatures. And our climate models did not forecast it, even though man-made carbon dioxide, the gas thought to be responsible for warming our planet, has continued to rise. So what on Earth is going on?

And:

According to research conducted by Professor Don Easterbrook from Western Washington University last November, the oceans and global temperatures are correlated. . . . Professor Easterbrook says: “The PDO cool mode has replaced the warm mode in the Pacific Ocean, virtually assuring us of about 30 years of global cooling.”

Let the shouting begin. Will Paul Hudson be drummed out of the circle of environmental journalists? Look what happened here, when Al Gore was challenged by a particularly feisty questioner at a conference of environmental journalists.

We have a chapter in SuperFreakonomics about global warming and it too will likely produce a lot of shouting, name-calling, and accusations ranging from idiocy to venality. It is curious that the global-warming arena is so rife with shrillness and ridicule. Where does this shrillness come from? . . .

No shrillness here. Professor Don Easterbrook from Western Washington University seems to have screwed up his calculations somewhere, but that happens. And Dubner did not make this claim himself; he merely featured a news article that featured this particular guy and treated him like an expert. Actually, Dubner and his co-author Levitt also wrote, “we believe that rising global temperatures are a man-made phenomenon and that global warming is an important issue to solve,” so I could never quite figure out in their blog he was highlighting an obscure scientist who was claiming that we were virtually assured of 30 years of cooling.

Anyway, we all make mistakes; what’s important is to learn from them. I hope Dubner and his Freaknomics colleagues learn from this particular prediction that went awry. Remember, back in 2009 when Dubner was writing about “A Headline That Will Make Global-Warming Activists Apoplectic,” and Don Easterbrook was “virtually assuring us of about 30 years of global cooling,” the actual climate-science experts were telling us that things would be getting hotter. The experts were pointing out that oft-repeated claims such as “For the last 11 years we have not observed any increase in global temperatures . . .” were pivoting off the single data point of 1998, but Dubner and Levitt didn’t want to hear it. Fiddling while the planet burns, one might say.

It’s not that the experts are always right, but it can make sense to listen to their reasoning instead of going on about apoplectic activists, feisty questioners, and shrillness.

Freakonomists getting outflanked

The media landscape has changed since 2005 (when the first edition of Freakonomics came out), 2009 (when they ran that ridiculous post pushing climate-change denial), and 2018 (when the above post appeared; I updated it in 2021 with further discussion, and here’s the news from 2023).

Back in the day, Steven Levitt was a “rogue economist,” a genial rebel who held a mix of political opinions (for example, in 2008 thinking Obama would be “the greatest president in history” while pooh-poohing concerns about recession at the time), along with some soft contrarianism (most notoriously claiming that drunk walking was worse than drunk driving, but also various little things like saying that voting in a presidential election is not so smart). Basically, he was positioning himself as being a little more playful and creative than the usual economics professor. A rogue relative to a stable norm.

I wonder how the Freakonomics team feels now, in an era of quasi-academic celebrities such as Dr. Oz and Jordan Peterson, and podcasters like Joe Rogan who push all sorts of conspiracy theories, and not just nutty but hey-why-not ideas such as UFO’s and space aliens but more dangerous positions such as vaccine denial.

Being a contrarian’s all fun and games when you’re defining yourself relative to a reasonable center, maybe not so much when you’re surrounded by crazies.

For example, what were Levitt and Dubner thinking back in 2009 when they published that credulous article featuring an eccentric climate change denier? I can’t know what they were thinking, but I suspect it was something like: “Hey, this guy deserves a hearing. And, in any case, we’re stirring things up. Conversation and debate are good things. Those global-warming activists are so shrill. Let’s make them apoplectic—that’ll be fun!”

The point is, this was all taking place in a media environment where climate change denial was marginalized. So they could run ridiculous pieces like the above-linked post without being concerned of having bad effects. They were just joking around, taking the piss, setting up boring Al Gore as a foil for “a particularly feisty questioner,” promoting a fringe character such as Professor Don Easterbrook from Western Washington University (he who told us in 2009 that are climatic conditions “virtually assuring us of about 30 years of global cooling”) secure in the belief that no one would take this claim seriously. Just a poke in the eye at humorless liberals, that’s all.

Recall that, around the same time, Levitt and Dubner also wrote, “we believe that rising global temperatures are a man-made phenomenon and that global warming is an important issue to solve” (see also here) so my take on the whole episode is that they felt ok promoting a fringe climate-change denier without concern that they could be upsetting the larger consensus. They got to have the fun of being edgy by promoting the prediction of “30 years of global cooling” without ever actually believing that ridiculous claim.

Nowadays, though, things are getting out of control, both with the climate and with extremists and wild takes in news and social media and in politics, and I imagine that it’s more difficult for the Freakonomics team to feel comfortable as rogues. They no longer have a stable center to push against.

A political science perspective

In political science we sometimes talk about proximity or directional voting. In proximity voting, you choose the party or candidate closest to us in policy preferences; in directional voting, you choose the party or candidate whose position is most extreme relative to the center while being in the same general direction as ours (to be precise, if we consider your position and each party’s position as vectors in a multidimensional space, you’d choose the party to maximize the dot product of your position and the party’s position, with that dot product being defined relative to some zero position in the center of the political spectrum). The rationale for proximity voting is obvious; the rationale for directional voting is that your vote has only a very small impact, which you can maximize by pushing the polity as far as you can in the desired direction.

There is a logic to directional voting; the problem arises when many people do it, with the result that extreme parties get real influence and even attain political power in a country.

Some examples of directional voting, or directional position-taking, include Levitt and Dubner pushing climate-change denial, people who should know better on the right supporting election denial in 2020, or, on the other side, center-leftists supporting police defunding, presumably following the reasoning that the police would not be defunded and the pressure to defund would merely cause police funding to decrease. Once you think to look, you can find this sort of political behavior all the time: a way to oppose the party in power is to support its fiercest opponents, even if you would not ever want those opponents to be in power either.

But . . . directional voting falls apart when the center does not hold.

What is a standard error?

I spoke at a session with the above title at the American Economic Association meeting a few months ago. It was organized by Serena Ng and Elie Tamer, and the other talks were given by Patrick Kline, James Powell, Jeffrey Wooldridge, and Bin Yu. In addition to speaking, Bin and I wrote short papers that will appear in the Journal of Econometrics. Here’s mine:

What is a standard error?

In statistics, the standard error has a clear technical definition: it is the estimated standard deviation of a parameter estimate. In practice, though, challenges arise when we go beyond the simple balls-in-urn model to consider generalizations beyond the population from which the data were sampled. This is important because generalization is nearly always the goal of quantitative studies. In this brief paper we consider three examples.

What is the standard error when the bias is unknown and changing (my bathroom scale)?

I recently bought a cheap bathroom scale. I took the scale home and zeroed it—there’s a little gear in front to turn. I tapped my foot on the scale, it went to -1 kg, I turned the gear a bit, then it went up to +2, then I turned a bit back to get it exactly to zero, and tapped again . . . it was back at -1. That was frustrating, but I still wanted to estimate my weight. So I got on and off the scale multiple times. The first few measurements were 66 kg, 65.5 kg, 68 kg, and 67 kg. A lot of variation! To get a good estimate in the presence of variation, it is recommended to take multiple measurements. So I did so. After 46 measurements, I got bored and stopped. The resulting measurements had mean 67.1 with standard deviation 0.7, hence a standard error of 0.7/sqrt(46) = 0.1.

Would I want to use the resulting 95% confidence interval, 67.1 +/- 0.2? Of course not! The whole scale is off by some unknown amount. What, then, to do? One approach would be to calibrate, either using a known object that weighs in the neighborhood of 67 kg or else my own weight measured on an accurate instrument. If that is not possible, then I would want a wider uncertainty interval to account for the uncertainty in the scale’s bias. The usual purpose of a standard error is to attach uncertainty to an estimate, and for that purpose, the usual standard error formula is inappropriate.

How do you interpret standard errors from a regression fit to the entire population (all 50 states)?

Sometimes we can all agree that if you have a whole population, your standard error is zero. This is basic finite population inference from survey sampling theory, if your goal is to estimate the population average or total. Consider a regression fit to data on all 50 states in the United States. This gives you an estimate and a standard error. Maybe the estimated coefficient of interest is only one standard error from zero, so it’s not “statistically significant.” But what does that mean, if you have the whole population? You might say that the standard error doesn’t matter, but the internal variation in the data still seems relevant, no?

One way to think about this is to imagine the regression being used for prediction. For example, you have all 50 states, but you might use the model to understand these states in a different year. So you can think of the data you have from the 50 states as being a sample from a larger population of state-years. It’s not a random or representative sample, though, in that it’s data from just one year. So to get the right uncertainty you’ll need to use a multilevel model or clustered standard errors. With data from only one cluster, some external assumptions will be needed to compute the standard error. Alternatively, one could just use the standard error that pops out of the regression, which would correspond to an implicit model of equal variation between and within years. Just because you have an exhaustive sample, that does not mean that the standard error is undefined or meaningless.

How should we account for nonsampling error when reporting uncertainties (election polls)?

In an analysis of state-level pre-election polls, we have found the standard deviation of empirical errors—the difference between the poll estimate and the election outcome—to be about twice as large as would be expected from the reported standard errors of the individual surveys. This sort of nonsampling error is usual in polling; what is special about election forecasting is that here we can observe the outcome and thus measure the total error directly. The question then arises: what standard error should a pollster report? The usual formula based on sampling balls from an urn (with some correction for weighting or survey adjustment) gives an internal measure of uncertainty but does not address the forecasting question. It would seem better to augment the standard error based on past levels of nonsampling error, but then the question arises of what to do in other sampling settings where no past calibration is available. In election polling we have some sense of that extra uncertainty; it seems wrong to implicitly set it to zero when we don’t know what to do about it.

How, then, should we interpret the standard error from textbook formulas or when fitting a regression? We can think of this as a lower-bound standard error or, more precisely, as a measure of variation corresponding to a particular model.

Summary

The appropriate standard error depends not just on the data and sampling model but also on the generalization of interest, and the model of variation across units and over time corresponding to the uses to which the estimate will be put. Deciding on a generalization of interest in a sampling or regression problem is similar to the problem of focusing on a particular average treatment effect in causal inference: thinking seriously about your replications (for the goal of getting the right standard error) and inferential goals, you might well get a better understanding of what you’re trying to do with your model.

Studying average associations between income and survey responses on happiness: Be careful about deterministic and causal interpretations that are not supported by these data.

Jonathan Falk writes:

This is an interesting story of heterogeneity of response, and an interesting story of “adversarial collaboration,” and an interesting PNAS piece. I need to read it again later this weekend, though, to see if the stats make sense.

The article in question, by Matthew Killingsworth, Daniel Kahneman, and Barbara Mellers, is called “Income and emotional well-being: A conflict resolved,” and it begins:

Do larger incomes make people happier? Two authors of the present paper have published contradictory answers. Using dichotomous questions about the preceding day, Kahneman and Deaton reported a flattening pattern: happiness increased steadily with log(income) up to a threshold and then plateaued. Using experience sampling with a continuous scale, Killingsworth reported a linear-log pattern in which average happiness rose consistently with log(income). We engaged in an adversarial collaboration to search for a coherent interpretation of both studies. A reanalysis of Killingsworth’s experienced sampling data confirmed the flattening pattern only for the least happy people. Happiness increases steadily with log(income) among happier people, and even accelerates in the happiest group. Complementary nonlinearities contribute to the overall linear-log relationship. . . .

I agree with Falk that the collaboration and evaluation of past published work is great, and I’m happy with the discussion, which is focused so strongly on data and measurement and how they map to conclusions. I don’t know why they call it “adversarial collaboration,” as I don’t see anything adversarial here. That’s a good thing! I’m glad they’re cooperating. Maybe they could just call it “collaboration from multiple perspectives” or something like that.

On the substance, I think the article has two main problems, both of which are exhibited by its very first line:

Do larger incomes make people happier?

Two problems here:

1. Determinism. The question, “Do larger incomes make people happier?”, does not admit variation. Larger incomes are gonna make some people happier in some settings.

2. Causal attribution. If I’m understanding correctly, the data being analyzed are cross-sectional; to put it colloquially, they’re looking at correlation, not causation.

3. Framing in terms of a null hypothesis. Neither of the two articles that motivated this work suggested a zero pattern.

Putting these together, the question, “Do larger incomes make people happier?”, would be more accurately written as, “How much happier are people with high incomes, compared to people with moderate incomes?”

Picky, Picky

You might say that I’m just being picky here; when they ask, “Do larger incomes make people happier?”, everybody knows they’re really talking about averages (not about “people” in general), that they’re talking about association (not about anything “making people happier”), and that they’re doing measurement, not answering a yes-or-no question.

And, sure, I’m a statistician. Being picky is my business. Guilty as charged.

But . . . I think my points 1, 2, 3 are relevant to the underlying questions of interest, and dismissing them as being picky would be a mistake.

Here’s why I say this.

First, the determinism and the null-hypothesis framing leads to a claim about, “Can money buy happiness?” We already know that money can buy some happiness, some of the time. The question, “Are richer people happier, on average?”, that’s not the same, and I think it’s a mistake to confuse one with the other.

Second, the sloppiness about causality ends up avoiding some important issues. Start with the question, “Do larger incomes make people happier?” There are many ways to have larger incomes, and these can have different effects.

One way to see this is to flip the question around and ask, “Do smaller incomes make people unhappier?” The funny thing is, based Kahneman’s earlier work on loss aversion, he’d probably say an emphatic Yes to that question. But we can also see that there are different ways to have a smaller income. You might choose to retire—or be forced to do so. You might get fired. Or you might take time off from work to take care of young children. Or maybe you’re just getting pulled by the tides of the national economy. All sorts of possibilities.

A common thread here is that it’s not necessarily the income causing the mood change; it’s that the change in income is happening along with other major events that can affect your mood. Indeed, it’s hard to imagine a big change in income that’s not associated with other big changes in your life.

Again, nothing wrong with looking at average associations of income and survey responses about happiness and life satisfaction. These average associations are interesting in their own right; no need to try to give them causal interpretations that they cannot bear.

Again, I like a lot of the above-linked paper. Within the context of the question, “How much happier are people with high incomes, compared to people with moderate incomes?”, they’re doing a clean, careful analysis, kinda like what my colleagues and I tried to do when reconciling different evaluations of the Millennium Villages Project, or as I tried to do when tracking down an iffy claim in political science. Starting with a discrepancy, getting into the details and figuring out what was going on, then stepping back and considering the larger implications: that’s what it’s all about.

Causal inference and the aggregation of micro effects into macro effects: The effects of wages on employment

James Traina writes:

I’m an economist at the SF Fed. I’m writing to ask for your first thoughts or suggested references on a particular problem that’s pervasive in my field: Aggregation of micro effects into macro effects.

This is an issue that has been studied since the 80s. For example, the individual-level estimates of wages on employment using quasi-experimental tax variation are much smaller than aggregate-level estimates using time series variation. More recently, there has been an active debate on how to port individual-level estimates of government transfers on consumption to macro policy.

Given your expertise, I was wondering if you had insight into how you or other folks in the stats / causal inference field would approach this problem structure more generally.

My reply: Here’s a paper from 2006, Multilevel (hierarchical) modeling: What it can and cannot do. The short answer is that you can estimate micro and macro effects in the same model, but you don’t necessarily have causal identification at both levels. It depends on the design.

You’ll also want a theoretical model. For example, in your model, if you want to talk about “the effects of wages,” it can help to consider potential interventions that could affect local wages. Such interventions could be a minimum-wage law, it could be inflation that reduces real (not nominal) wages, it could be national economic conditions that make the labor market more or less competitive, etc. You can also think about potential interventions at an individual level, such as a person getting education or training, marrying or having a child, the person’s employer changing its policies, whatever.

I don’t know enough about your application to give more detail. The point is that “wages” is not in itself a treatment. Wages is a measured variable, and different wage-effecting treatments can have different effects on employment. You can think of these as instruments, even if you’re not actually doing an instrumental variables analysis. Also, treatments that affect individual wages will be different than treatments that affect aggregate wages, so it’s no surprise that they would have different effects on employment. There’s no strong theoretical reason to think that effects would be the same.

Finally, I don’t understand how government transfers connect to wages in your problem. Government transfers do not directly affect wages, do they? So I feel like I’m missing some context here.

Explore Ledger Live, the ultimate crypto companion. Securely manage your digital assets, track market trends, and execute trades with ease. Ledger Live: where security meets simplicity.

New research on social media during the 2020 election, and my predictions

Back in 2020, leading academics and researchers at the company now known as Meta put together a large project to study social media and the 2020 US elections — particularly the roles of Instagram and Facebook. As Sinan Aral and I had written about how many paths for understanding effects of social media in elections could require new interventions and/or platform cooperation, this seemed like an important development. Originally the idea was for this work to be published in 2021, but there have been some delays, including simply because some of the data collection was extended as what one might call “election-related events” continued beyond November and into 2021. As of 2pm Eastern today, the news embargo for this work has been lifted on the first group of research papers.

I had heard about this project back a long time ago and, frankly, had largely forgotten about it. But this past Saturday, I was participating in the SSRC Workshop on the Economics of Social Media and one session was dedicated to results-free presentations about this project, including the setup of the institutions involved and the design of the research. The organizers informally polled us with qualitative questions about some of the results. This intrigued me. I had recently reviewed an unrelated paper that included survey data from experts and laypeople about their expectations about the effects estimated in a field experiment, and I thought this data was helpful for contextualizing what “we” learned from that study.

So I thought it might be useful, at least for myself, to spend some time eliciting my own expectations about the quantities I understood would be reported in these papers. I’ve mainly kept up with the academic and  grey literature, I’d previously worked in the industry, and I’d reviewed some of this for my Senate testimony back in 2021. Along the way, I tried to articulate where my expectations and remaining uncertainty were coming from. I composed many of my thoughts on my phone Monday while taking the subway to and from the storage unit I was revisiting and then emptying in Brooklyn. I got a few comments from Solomon Messing and Tom Cunningham, and then uploaded my notes to OSF and posted a cheeky tweet.

Since then, starting yesterday, I’ve spoken with journalists and gotten to view the main text of papers for two of the randomized interventions for which I made predictions. These evaluated effects of (a) switching Facebook and Instagram users to a (reverse) chronological feed, (b) removing “reshares” from Facebook users’ feeds, and (c) downranking content by “like-minded” users, Pages, and Groups.

My guesses

My main expectations for those three interventions could be summed up as follows. These interventions, especially chronological ranking, would each reduce engagement with Facebook or Instagram. This makes sense if you think the status quo is somewhat-well optimized for showing engaging and relevant content. So some of the rest of the effects — on, e.g., polarization, news knowledge, and voter turnout — could be partially inferred from that decrease in use. This would point to reductions in news knowledge, issue polarization (or coherence/consistency), and small decreases in turnout, especially for chronological ranking. This is because people get some hard news and political commentary they wouldn’t have otherwise from social media. These reduced-engagement-driven effects should be weakest for the “soft” intervention of downranking some sources, since content predicted to be particularly relevant will still make it into users’ feeds.

Besides just reducing Facebook use (and everything that goes with that), I also expected swapping out feed ranking for reverse chron would expose users to more content from non-friends via, e.g., Groups, including large increases in untrustworthy content that would normally rank poorly. I expected some of the same would happen from removing reshares, which I expected would make up over 20% of views under the status quo, and so would be filled in by more Groups content. For downranking sources with the same estimated ideology, I expected this would reduce exposure to political content, as much of the non-same-ideology posts will be by sources with estimated ideology in the middle of the range, i.e. [0.4, 0.6], which are less likely to be posting politics and hard news. I’ll also note that much of my uncertainty about how chronological ranking would perform was because there were a lot of unknown but important “details” about implementation, such as exactly how much of the ranking system really gets turned off (e.g., how much likely spam/scam content still gets filtered out in an early stage?).

How’d I do?

Here’s a quick summary of my guesses and the results in these three papers:

Table of predictions about effects of feed interventions and the results

It looks like I was wrong in that the reductions in engagement were larger than I predicted: e.g., chronological ranking reduced time spent on Facebook by 21%, rather than the 8% I guessed, which was based on my background knowledge, a leaked report on a Facebook experiment, and this published experiment from Twitter.

Ex post I hypothesize that this is because of the duration of these experiments allowed for continual declines in use over months, with various feedback loops (e.g., users with chronological feed log in less, so they post less, so they get fewer likes and comments, so they log in even less and post even less). As I dig into the 100s of pages of supplementary materials, I’ll be looking to understand what these declines looked like at earlier points in the experiment, such as by election day.

My estimates for the survey-based outcomes of primary interest, such as polarization, were mainly covered by the 95% confidence intervals, with the exception of two outcomes from the “no reshares” intervention.

One thing is that all these papers report weighted estimates for a broader population of US users (population average treatment effects, PATEs), which are less precise than the unweighted (sample average treatment effect, SATE) results. Here I focus mainly on the unweighted results, as I did not know there was going to be any weighting and these are also the more narrow, and thus riskier, CIs for me. (There seems to have been some mismatch between the outcomes listed in the talk I saw and what’s in the papers, so I didn’t make predictions for some reported primary outcomes and some outcomes I made predictions for don’t seem to be reported, or I haven’t found them in the supplements yet.)

Now is a good time to note that I basically predicted what psychologists armed with Jacob Cohen’s rules of thumb might call extrapolate to “minuscule” effect sizes. All my predictions for survey-based outcomes were 0.02 standard deviations or smaller. (Recall Cohen’s rules of thumb say 0.1 is small, 0.5 medium, and 0.8 large.)

Nearly all the results for these outcomes in these two papers were indistinguishable from the null (p > 0.05), with standard errors for survey outcomes at 0.01 SDs or more. This is consistent with my ex ante expectations that the experiments would face severe power problems, at least for the kind of effects I would expect. Perhaps by revealed preference, a number of other experts had different priors.

A rare p < 0.05 result is that that chronological ranking reduced news knowledge by 0.035 SDs with 95% CI [-0.061, -0.008], which includes my guess of -0.02 SDs. Removing reshares may have reduced news knowledge even more than chronological ranking — and by more than I guessed.

Even with so many null results I was still sticking my neck out a bit compared with just guessing zero everywhere, since in some cases if I had put the opposite sign my estimate wouldn’t have been in the 95% CI. For example, downranking “like-minded” sources produced a CI of [-0.031, 0.013] SDs, which includes my guess of -0.02, but not its negation. On the other hand, I got some of these wrong, where I guessed removing reshares would reduce affective polarization, but a 0.02 SD effect is outside the resulting [-0.005, +0.030] interval.

It was actually quite a bit of work to compare my predictions to the results because I didn’t really know a lot of key details about exact analyses and reporting choices, which strikingly even differ a bit across these three papers. So I might yet find more places where I can, with a lot of reading and a bit of arithmetic, figure out where else I may have been wrong. (Feel free to point these out.)

Further reflections

I hope that this helps to contextualize the present results with expert consensus — or at least my idiosyncratic expectations. I’ll likely write a bit more about these new papers and further work released as part of this project.

It was probably an oversight for me not to make any predictions about the observational paper looking at polarization in exposure and consumption of news media. I felt like I had a better handle on thinking about simple treatment effects than these measures, but perhaps that was all the more reason to make predictions. Furthermore, given the limited precision of the experiments’ estimates, perhaps it would have been more informative (and riskier) to make point predictions about these precisely estimated observational quantities.

[This post is by Dean Eckles. I want to note that I was an employee or contractor of Facebook (now Meta) from 2010 through 2017. I have received funding for other research from Meta, Meta has sponsored a conference I organize, and I have coauthored with Meta employees as recently as earlier this month. I was also recently a consultant to Twitter, ending shortly after the Musk acquisition. You can find all my disclosures here.]

Problem with the University of Wisconsin’s Area Deprivation Index. And, no, face validity is not “the weakest of all possible arguments.”

A correspondent writes:

I thought you might care to comment on a rebuttal in today’s HealthAffairs. I find it a poor non-defense that relies on “1000s of studies used our measure and found it valid”, as well as attacks on the critics of their work.

The issue began when the Center of Medicare & Medicaid Services (CMS) decided to explore a health equity payment model called ACO-REACH. CMS chose a revenue neutral scheme to remove some dollars from payments to providers serving the most-advantaged people and re-allocate those dollars to the most disadvantaged. Of course, CMS needs to choose a measure of poverty that is 100% available and easy to compute. These requirements limit the measure to a poverty index available from Census data.

CMS chose to use a common poverty index, University of Wisconsin’s Area Deprivation Index (ADI). Things got spicy earlier this year when some other researchers noticed that no areas in the Bronx or south-eastern DC are in the lowest deciles of the ADI measure. After digging into the ADI methods a bit deeper, it seems the issue is that the ADI does not scale the housing dollars appropriately before using that component in a principal components analysis to create the poverty index.

One thing I find perplexing about the rebuttal from UWisc is that it completely ignores the existence of every other validated poverty measure, and specifically the CDC’s Social Vulnerability Index. Their rebuttal pretends that there is no alternative solution available, and therefore the ADI measure must be used as is. Lastly, while ADI is publicly available, it is available under a non-commercial license so it’s a bit misleading for the authors to not disclose that they too have a financial interest in pushing the ADI measure while accusing their critics of financial incentives for their criticism.

The opinions expressed here are my own and do not reflect those of my employer or anyone else. I would prefer to remain anonymous if you decide to report this to your blog, as I wish to not tie these personal views to my employer.

Interesting. I’d never heard of any of this.

Here’s the background:

Living in a disadvantaged neighborhood has been linked to a number of healthcare outcomes, including higher rates of diabetes and cardiovascular disease, increased utilization of health services, and earlier death1-5. Health interventions and policies that don’t account for neighborhood disadvantage may be ineffective. . . .

The Area Deprivation Index (ADI) . . . allows for rankings of neighborhoods by socioeconomic disadvantage in a region of interest (e.g., at the state or national level). It includes factors for the theoretical domains of income, education, employment, and housing quality. It can be used to inform health delivery and policy, especially for the most disadvantaged neighborhood groups. “Neighborhood” is defined as a Census block group. . . .

The rebuttal

Clicking on the above links, I agree with my correspondent that there’s something weird about the rebuttal article, starting with its title, “The Area Deprivation Index Is The Most Scientifically Validated Social Exposome Tool Available For Policies Advancing Health Equity,” which elicits memories of Cold-War-era Pravda, or perhaps an Onion article parodying the idea of someone protesting too much.

The article continues with some fun buzzwords:

This year, the Center for Medicare and Medicaid Innovation (CMMI) took a ground-breaking step, creating policy aligning with multi-level equity science and targeting resources based on both individual-level and exposome (neighborhood-level) disadvantage in a cost-neutral way.

This sort of bureaucratic language should not in itself be taken to imply that there’s anything wrong with the Area Deprivation Index. A successful tool in this space will get used by all sorts of agencies, and bureaucracy will unavoidably spring up around it.

Let’s read further and see how they respond to the criticism. Here they go:

Hospitals located in high ADI neighborhoods tend to be hit hardest financially, suggesting health equity aligned policies may offer them a lifeline. Yet recently, CMS has been criticized for selecting ADI for use in its HEBA. According to behavioral economics theory, potential losers will always fight harder than potential winners, and in a budget-neutral innovation like ACO REACH there are some of both.

I’m not sure the behavioral economics framing makes sense here. Different measures of deprivation will correspond to different hospitals getting extra funds, so in that sense both sides in the debate represent potential winners and losers from different policies.

They continue:

CMS must be allowed time to evaluate the program to determine what refinements to its methodology, if any, are needed. CMS has signaled openness to fine-tune the HEBA if needed in the future. Ultimately, CMS is correct to act now with the tools of today to advance health equity.

Sure, but then you could use one of the other available indexes, such as the Social Deprivation Index or the Social Vulnerability Index, right? It seems there are two questions here: first, whether to institute this new policy to “incentivize medical groups to work with low-income populations”; second, whether there are any available measures of deprivation that make sense for this purpose; third, if more than one measure is available, which one to use.

So now on to their defense of the Area Deprivation Index:

The NIH-funded, publicly availably ADI is an extensively validated neighborhood-level (exposome) measure that is tightly linked to health outcomes in nearly 1000 peer-reviewed, independent scientific publications; is the most commonly used social exposome measure within NIH-funded research today; and undergoes a rigorous, multidisciplinary evaluation process each year prior to its annual update release. Residing in high ADI neighborhoods is tied to biological processes such as accelerated epigenetic aging, increased disease prevalence and increased mortality, poor healthcare quality and outcomes, and many other health factors in research studies that span the full US.

OK, so ADI is nationally correlated with various bad outcomes. This doesn’t yet address the concern of the measure having problems locally.

But they do get into the details:

A recent peer-reviewed article argued that the monetary values in the ADI should be re-weighted and an accompanying editorial noted that, because these were “variables that were measured in dollars,” they made portions of New York State appear less disadvantaged than the authors argued they should be. Yet New York State in general is a very well-resourced state with one of the ten highest per capita incomes in the country, reflected in their Medicaid Federal Medical Assistance Percentage (FMAP). . . .

Some critics relying on face validity claim the ADI does not perform “well” in cities with high housing costs like New York, and also California and Washington, DC, and suggest that a re-weighted new version be created, again ignoring evidence demonstrating the strong link between the ADI and health in all kinds of cities including New York (also here), San Francisco, Houston, San Antonio, Chicago, Detroit, Atlanta, and many others. . . .

That first paragraph doesn’t really address the question, as the concerns about the South Bronx not having a high deprivation index are about one part of New York, not “New York State in general.” But the rebuttal article does offer two links about New York specifically, so let me take a look:

Associations between Amygdala-Prefrontal Functional Connectivity and Age Depend on Neighborhood Socioeconomic Status:

Given the bimodal distribution of ADI percentiles in the current sample, the variable was analyzed in three groups: low (90–100), middle (11–89), and high neighborhood SES.

To get a sense of things, I went to the online Neighborhood Atlas and grabbed the map of national percentiles for New York State:

So what they’re doing is comparing some rich areas of NYC and its suburbs; to some low- and middle-income parts of the city, suburbs, and upstate; to some low-income rural and inner-city areas upstate.

Association Between Residential Neighborhood Social Conditions and Health Care Utilization and Costs:

Retrospective cohort study. Medicare claims data from 2013 to 2014 linked with neighborhood social conditions at the US census block group level of 2013 for 93,429 Medicare fee-for-service and dually eligible patients. . . . Disadvantaged neighborhood conditions are associated with lower total annual Medicare costs but higher potentially preventable costs after controlling for demographic, medical, and other patient characteristics. . . . We restricted our sample to patients with 9-digit residential zip codes available in New York or New Jersey . . .

I don’t see the relevance of these correlations to the criticisms of the ADI.

To return to our main thread, the rebuttal summarizes:

The ADI is currently the most validated scientific tool for US neighborhood level disadvantage. This does not mean that other measures may not eventually also meet this high bar.

My problem here is with the term “most validated.” I’m not sure how to take this, given that all this validation didn’t seem to have shown that problem with the South Bronx etc. But, sure, I get their general point: When doing research, better to go with the devil you know, etc.

The rebuttal authors add:

CMS should continue to investigate all options, beware of conflicts of interest, and maintain the practice of vetting scientific validated, evidence-based criteria when selecting a tool to be used in a federal program.

I think we can all agree on that.

Beyond general defenses of the ADI on the grounds that many people use it, the rebuttal authors make an interesting point about the use of neighborhood-level measures more generally:

Neighborhood-level socioeconomic disadvantage is just as (and is sometimes more) important than individual SES. . . . These factors do not always overlap, one may be high, the other low or vice versa. Both are critically important in equity-focused intervention and policy design. In their HEBA, as aligned with scientific practice, CMS has included one of each—the ADI captures neighborhood-level factors, and dual Medicare and Medicaid eligibility represents an individual-level factor. Yet groups have mistakenly conflated individual-level and neighborhood-level factors, wrongly suggesting that neighborhood-level factors are only used because additional individual factors are not readily available.

They link to a review article. I didn’t see the reference there to groups claiming that neighborhood-level factors are only used because additional individual factors are not readily available, but I only looked at that linked article quickly so I probably missed the relevant citation.

The above are all general points about the importance of using some neighborhood-level measure of disadvantage.

But what about the specific concerns raised with the ADI, such as the labeling most of the South Bronx as being low disadvantage (in the 10th to 30th percentile nationally)? Here’s what I could find in the rebuttal:

These assertions rely on what’s been described as “the weakest of all possible arguments”: face validity—defined as the appearance of whether or not something is a correct measurement. This is in contrast to empirically-driven tests for construct validity. Validation experts universally discredit face validity arguments, classifying them as not legitimate, and more aligned with “marketing to a constituency or the politics of assessment than with rigorous scientific validity evidence.” Face validity arguments on their own are simply not sufficient in any rigorous scientific argument and are fraught with potential for bias and conflict of interest. . . .

Re-weighting recommendations run the risk of undermining the strength and scientific rigor of the ADI, as any altered ADI version no longer aligns with the highly-validated original Neighborhood Atlas ADI methodology . . .

Some have suggested that neighborhood-level disadvantage metrics be adjusted to specific needs and areas. We consider this type of change—re-ranking ADI into smaller, custom geographies or adding local adjustments to the ADI itself—to be a type of gerrymandering. . . . A decision to customize the HEBA formula in certain geographies or parts of certain types of locations will benefit some areas and disservice others . . .

I disagree with the claim that face validity is “the weakest of all possible arguments.” For example, saying that a method is good because it’s been cited thousands of times, or saying that local estimates are fine because the national or state-level correlations look right, those are weaker arguments! And if validation experts universally discredit face validity arguments . . . ummmm, I’m not sure who are the validation experts out there, and in any case I’d like to see the evidence of this purportedly universal view. Do validation experts universally think that North Korea has moderate electoral integrity?

The criticism

Here’s what the critical article lists as limitations of the ADI:

Using national ADI benchmarks may mask disparities and may not effectively capture the need that exists in some of the higher cost-of-living geographic areas across the country. The ADI is a relative measure for which included variables are: median family income; percent below the federal poverty level (not adjusted geographically); median home value; median gross rent; and median monthly mortgage. In some geographies, the ADI serves as a reasonable proxy for identifying communities with poorer health outcomes. For example, many rural communities and lower-cost urban areas with low life expectancy are also identified as disadvantaged on the national ADI scale. However, for parts of the country that have high property values and high cost of living, using national ADI benchmarks may mask the inequities and poor health outcomes that exist in these communities. . . .

They recommend “adjusting the ADI for variations in cost of living,” “recalibrating the ADI to a more local level,” or “making use of an absolute measure such as life expectancy rather than a relative measure such as the ADI.”

There seem to be two different things going on here. The first is that ADI is a socioeconomic measure, and it could also make sense to include a measure of health outcomes. The second is that, as a socioeconomic measure, ADI seems to have difficulty in areas that are low income but with high housing costs.

My summary

1. I agree with my correspondent’s email that led off this post. The criticisms of the ADI seem legit—indeed, they remind me a bit of the Human Development Index, which a similar problem of giving unreasonable summaries that can be attributed to someone constructing a reasonable-seeming index and then not looking into the details; see here for more. There was also the horrible, horrible Electoral Integrity Index, which had similar issues of face validity that could be traced back to fundamental issues of measurements.

2. I also agree with my correspondent that the rebuttal article is bad for several reasons. The rebuttal:
– does not ever address the substantive objections;
– doesn’t seem to recognize that, just because a measure gives reasonable national correlations, that doesn’t mean that it can’t have serious local problems;
– leans on an argument-from-the-literature that I don’t buy, in part out of general distrust of the literature and in part because none of the cited literature appears to address the concerns on the table;
– presents a ridiculous argument against the concept of face validity.

Face validity—what does that mean?

Let me elaborate upon that last point. When a method produces a result that seems “on its face” to be wrong, that does not necessarily tell us that the method is flawed. If something contradicts face validity, that tells us that it contradicts our expectations. It’s a surprise. One possibility is that our expectations were wrong! Another possibility is that there is a problem with the measure, in which case the contradiction with our expectations can help us understand what went wrong. That’s how things went with the political science survey that claimed that North Korea was a moderately democratic country, and that’s how things seem to be going with the Area Deprivation Index. Even if it has thousands of citations, it can still have flaws. And in this case, the critics seem to have gone in and found where some of the flaws are.

In this particular example, the authors of the rebuttal have a few options.

They could accept the criticisms of their method and try to do better.

Or they could make the affirmative case that all these parts of the South Bronx, southeast D.C., etc., are not actually socioeconomically deprived. Instead they kind of question that these areas are deprived (“New York State in general is a very well-resourced state”) but without quite making that claim. I think one reason they’re stuck in the middle is politics. Public health is in general coming from the left side of the political spectrum and, from the left, if an area is poor and has low life expectancy, you’d call it deprived. From the right, you could argue that these poor areas already get tons of government support and that all this welfare dependence just compounds the problem. From a conservative perspective, you might argue that these sorts of poor neighborhoods are not “deprived” but rather are already oversaturated with government support. But I don’t think we’d be seeing much of that argument in the health-disparities space.

Or they could make a content-low response without addressing the problem. Unfortunately, that’s the option they chose.

I have no reason to think they’ve chosen to respond poorly here. My guess is that they’re soooo comfortable with their measure, soooooo sure it’s right, that they just dismissed the criticism without ever thinking about it. Which is too bad. But now they have this post! Not too late for them to do better. Tomorrow’s another day, hey!

P.S. My correspondent adds:

The original article criticizing the ADI measure has some map graphic sins that any editor should have removed before publication. Here are some cleaner comparisons of the city data. The SDI measure in those plots is the Social Deprivation Index from Robert Graham Center.

Washington, D.C.:

New York City:

Boston:

San Francisco area:

The vicious circle of corroboration or pseudo-confirmation in science and engineering

Allan Cousins writes:

I have recently been thinking about the way in which professionals come to accumulate “knowledge” over their careers and how that process utilizes (read: abuses) the notion of corroboration. I believe this might be of interest to both of you and so I wanted to see if either of you might have any insights or comments.

In particular, I have been thinking about professional endeavours that have dichotomous outcomes where the range of possibilities is restricted to (or perhaps more accurately, viewed as) it either worked or it did not work. For the purposes of this discussion I will look at structural engineering but I believe the phenomenon I am about to describe is just as applicable to other similarly characterized disciplines. In structural engineering: the structure either stood up or it collapsed, the beam either carried the load or it did not, etc. In my experience there are nearly as many theories of how structures work as there are structural engineers. But this wide range of opinions among structural engineers is certainly not because the underlying concepts are not well understood. That may have been true in 1850 but not today. In fact, structural engineering is quite mature as a field and there are very few concepts (except at the edges of the field) where such a diverse range of thought could be justified.

This begs the question of how could this unsatisfactory state of affairs have come to pass? I have often pondered this but only recently have come to what I think to be a reasonable explanation. First, let us rule out the idea that structural engineering professionals are of below average intelligence (or rather below some required intelligence threshold for such endeavors only known to Omniscient Jones). Under such an assumption I believe that the likely answer to our question comes down to an interplay between industry dynamics, an abuse of the concept of corroboration, and the nature of the outcomes inherent to the field.

Even if engineers have never heard of the concept of Philosophy of Science (and most have not) they are apt to act in ways akin to the typical scientist. That is, they go about their enterprise (designing structures) by continuously evaluating their understanding of the underlying structural mechanics by looking at and seeking out corroborating evidence. However, unlike scientists structural engineers don’t usually have the ability to conduct risky tests (in the popperian sense) in their day to day designs. By definition the predicted outcome of a risky test is likely to be wrong in absence of the posited theory and if structural engineers were routinely conducting such field tests newspaper headlines would be replete with structural engineering failures. But today structural engineering failures are quite rare and when they happen they are usually small in magnitude (one of the greatest structural engineering failures in US history was the Hyatt Regency Walkway collapse and it only caused 114 deaths. For comparison that is about the same number of deaths caused by road accidents in a single DAY in the US). Indeed, building codes and governing standards are codified in such a way that the probability of failure of any given element in a system is quite a rare event (global failure even rarer still). What that means is that even if what a structural engineer believes to be true about the structural systems that they design actually has very little verisimilitude (read: is mostly wrong and to a severe degree) their designs will not fail in practice as long as they follow codified guidelines. It is only when structural engineers move away from the typical (where standard details are the norm and codes contain prescribed modes of analysis / design) where gaps in their understanding become apparent due to observed failures. What this means then is that while the successful outcome of each “test” (each new structural design) is likely to be taken by the designer as corroborating their understanding (in the same sense that it does for the scientist), it does not necessarily even provide the most meager of evidence that the designer has a good grasp of their discipline. In fact, it is possible (though admittedly not overly likely) that a designer has everything backwards and yet their designs don’t fail because of the prescribed nature of governing codes.

The above leaves us with an interesting predicament. It seems clear that structural engineers or others in similarly situated disciplines cannot rely on outcomes to substantiate their understanding. Though in practice that is what they largely do; they are human after all.

This lack of ability to conduct risky tests interplays with industry dynamics and in not a particularly promising way. Those who commission structural designs are unlikely to care about the design itself (except to the extent that it doesn’t fail and doesn’t mess with the intended aesthetic), and as a result, structural engineering tends to be treated like a commodity product where the governing force is price. What that means is that there is an overwhelming pressure to get designs out the door as quickly as possible lest a structural engineering firm lose money on its bid. This pressure all but guarantees that even if senior structural engineers have a good understanding of structural principles the demands for their time leave few hours in the day to be spent on mentorship and review of young engineers’ work product. As a result, young engineers are unlikely to be able to rely on senior engineers to correct their misunderstanding of structural principles. That pretty much leaves only one other avenue for the young engineer to gain true understanding and that is via self-teaching of the literature and the like. However, given the lack of ability to construct risky tests (see above) the self-learning route is apt to lead young structural engineers to think that they have a good understanding of certain concepts (because they see corroborating evidence in their “successful” designs) where that is not the case. Though to be fair to my brethren I am assuming that the average young engineer does not have the ability to discern true engineering principles from the literature on their own without aid. However, I believe this assumption to hold, on average.

This leads to a cycle where young engineers – who have a less than perfect understanding of structural systems that goes unchecked – become senior engineers who in turn are looked up to by a new crop of young engineers. The now senior engineers mentor the young engineers, to the extent time demands allow, and distill their misknowledge to them. Those young engineers eventually become senior. And in the extreme, the cycle repeats progressively until “knowledge” at the most senior levels of the field is almost devoid of any verisimilitude at all. Naturally there will be counterbalancing forces where some verisimilitude is maintained but I do think the cycle, as I have described it, is at least a decent characterture of how things unfold in practice. It’s worth remarking that many on the outside will never see this invisible cycle because it is shielded from them by the fact that structures tend to stand up!

It seems to me that this unfortunate dynamic is likely to play out in any discipline where outcomes are dichotomous in nature and where the unwanted outcome (such as structural failure) is a low probability event by construction (and is unconnected to true understanding of the underlying concepts). It is certainly interesting to think about, and when the above phenomenon is coupled with human tendency to ascribe good outcomes to skill, and poor outcomes to bad luck, the result in terms of knowledge accumulation / dissemination may be quite unsatisfactory.

I think what I have just argued is that professional activities that become commoditized are likely to be degenerative over time. This would certainly accord with my experience in structural engineering and other fields where I have some substantive knowledge. And I wanted to see if you would agree or not. Do you have any stark counter examples from your professional life that you can recall? Do you think I am being unduly pessimistic?

There are two things going on here:

1. Corroboration, and the expectation of corroboration, as a problem. This relates to what I’ve called the confirmationist paradigm of science, where the point of experimentation is to confirm theories. The motivations are then all in the wrong places, just in general. Quantitative analysis under uncertainty (i.e., statistics) adds another twist to the vicious cycle of confirmation, with the statistical significance filter and the 80% power lie, by which effects get overestimated, motivating future studies that overestimate effect sizes, etc., until entire subfields get infested with wild and unrealistic overestimates.

2. The sociological angle, with students following their advisors, advisors promoting former students, etc. I don’t have so much to say about this one, but I guess that it’s part of the story too.

Also relevant to this discussion is the recent book, False Feedback in Economics: The Case for Replication, by Andrin Spescha.

Cheating in science, sports, journalism, business, and art: How do they differ?

I just read “Lying for Money: How Legendary Frauds Reveal the Workings of Our World,” by Dan Davies.

I think the author is the same Dan Davies who came up with the saying, “Good ideas do not need lots of lies told about them in order to gain public acceptance,” and also the “dsquared” who has occasionally commented on this blog, so it is appropriate that I heard about his book in a blog comment from historian Sean Manning.

As the title of this post indicates, I’m mostly going to be talking here about the differences between frauds in three notoriously fraud-infested but very different fields of human endeavor: science, sports, and business.

But first I wanted to say that this book by Davies is one of the best things about economics I’ve ever read. I was trying to think what made it work so well, and I realized that the problem with most books about economics is that they’re advertising the concept of economics, or they’re fighting against dominant economics paradigms . . . One way or another, those books are about economics. Davies’s book is different in that he’s not saying that economics is great, he’s not defensive about economics, and he’s not attacking it either. His book is not about about economics; it’s about fraud, and he’s using economics as one of many tools to help understand fraud. And then when he gets to Chaper 7 (“The Economics of Fraud”), he’s well situated to give the cleanest description I’ve ever seen of economics, integrating micro to macro in just a few pages. I guess a lot of readers and reviewers will have missed that bit because it’s not as lively as the stories at the front of the book, also, who ever gets to Chapter 7, right?, and that’s kinda too bad. Maybe Davies could follow up with a short book, “Economics, what’s it all about?” Probably not, though, as there are already a zillion other books of this sort, and there’s only one “Lying for Money.” I’m sure there are lots of academic economists and economics journalists who understand the subject as well or better than Davies; he just has a uniquely (as far as I’ve seen) clear perspective, neither defensive nor oppositional but focused on what’s happening in the world rather than on academic or political battles for the soul of the field. (See here and here for further discussion of this point.)

Cheating in business

Cheating in business is what “Lying for Money” is all about. Davies mixes stories of colorful fraudsters with careful explanations of how the frauds actually worked, along with some light systematizing of different categories of financial crime.

In his book, Davies does a good job of not blaming the victims. He does not push the simplistic line that “you can’t cheat an honest man.” As he points out, fraud is easier to commit in an environment of widespread trust, and trust is in general a good thing in life, both because it is more pleasant to think well of others and also because it reduces transaction costs of all sorts.

Linear frauds and exponential frauds

Beyond this, one of the key points of the book is that there are two sorts of frauds, which I will call linear and exponential.

In a linear fraud, the fraudster draws money out of the common reservoir at a roughly constant rate. Examples of linear frauds include overbilling of all sorts (medical fees, overtime payments, ghost jobs, double charging, etc.), along with the flip side of this, which is not paying for things (tax dodging, toxic waste dumping, etc.). A linear fraud can go on indefinitely, until you get caught.

In an exponential fraud, the fraudster needs to keep stealing more and more to stay solvent. Examples of exponential frauds include pyramid schemes (of course), mining fraud, stock market manipulations, and investment scams of all sorts. A familiar example is Bernie Madoff, who raised zillions from people by promising them unrealistic returns on their money, but as a result incurred many more zillions of financial obligations. The scam was inherently unsustainable. Similarly with Theranos: the more money they raised from their investors, the more trouble they were in, given that they didn’t actually ever have a product. With an exponential fraud you need to continue expanding your circle of suckers—once that stops, you’re done.

A linear fraud is more sustainable—I guess the most extreme example might be Mister 880, the counterfeiter of one-dollar bills who was featured in a New Yorker article many years ago—but exponential frauds can grow your money faster. Embezzling can go either way: in theory you can sustainably siphon off a little bit every month without creating noticeable problems, but in practice embezzlers often seem to take more money than is actually there, giving them unending future obligations to replace the missing funds.

With any exponential fraud, the challenge is to come up with an exit strategy. Back in the day, you could start a pyramid scheme or other such fraud, wait until a point where the scam had gone long enough that you had a good profit but before you reach the sucker event horizon, and then skip town. The only trick is to remember to jump off the horse before it collapses. For business frauds, though, there’s a paper trail, so it’s harder to leave without getting caught. The way Davies puts it is that in your life you have one chance to burn your reputation in this way.

Another way for a fraudster to escape, financially speaking, is to go legit. If you’re a crooked investor, you can take your paper fortune to the racetrack or the stock market and make some risky bets: if you win big, you can pay off your funders and retire. Unfortunately, if you win big, and you’re already the kind of person to conduct an exponential fraud in the first place, it seems likely you’ll just take this as a sign that you should push further. Sometimes, though, you can keep things going indefinitely by converting an exponential into a linear scheme, as seems to have happened with some multilevel modeling operations. As Davies says, if you can get onto a stable financial footing, you have something that could be argued was never a fraud at all, just a successful business that makes its money by convincing people to pay more for your product than it’s worth.

The final exit strategy is recidivism, or perhaps rehabilitation. Davies shares many stories of fraudsters who got caught, went to prison, then popped out and committed similar crimes again. They kept doing what they were
good at! Every once in awhile you see a fraudster who managed to grease enough palms that after getting caught he could return to life as a rich person, for example Michael Milken.

One other thing. Yes, exponential frauds are especially unsustainable, but linear frauds can be tricky to maintain too. Even if you’re cheating people at a steady, constant rate, so you have no pressing need to raise funds to cover your past losses, you’re still leaving a trail of victims behind, and any one of them can decide to be the one to put in the effort to stop you. More victims = greater odds of being tracked down. There’s all sorts of mystique about “cooling off the mark,” but my impression that the main way that scammers get away with their frauds is by maintaining some physical distance from the people they’ve scammed, and by taking advantage of the legal system to make life difficult for any whistleblowers or victims who come after them. Again, see Theranos.

Cheating in science

Science fraud is a mix of linear and exponential. The linear nature of the fraud is that it’s typically a little bit in paper after paper, grant proposal after grant proposal, Ted talk after Ted talk, a lie here, an exaggeration there, some data manipulation, some p-hacking, at each time doing whatever it takes to get the job done. The fraud is linear in that there’s no compounding; it’s not like each new research project requires an ever-larger supply of fake data to make up for what was taken last time.

On the other hand, there’s a potentially exponential problem that, if you use fraud to produce an important “discovery,” others will want to replicate it for themselves, and when those replications fail, you’ll need to put in even more effort to prop up your original claims. In business, this propping-up can take different forms (new supplies of funds, public relations, threats, delays, etc.), and similarly there are different ways in science to prop up fake claims: you can ignore the failed replications and hope for the best, you can attack the replicators, you can use connections in the news media to promote your view and use connections in academia to publish purported replications of your own, you can jump sideways into a new line of research and cheat to produce success there . . . lots of options. The point is, fake scientific success is hydra-headed as it will spawn continuing waves of replication challenges. As with financial fraud, the challenge, after manufacturing a scientific success, is to draw a line under it, to get it accepted as canon, something they can never take away from you.

Cheating in sports

Lance Armstrong is an example of an exponential fraud. He doped to win bike races—apparently everybody was doping at the time. But Lance was really really good at doping. People started to talk, and then Lance had to do more and more to cover it up. He engaged in massive public relations, he threatened people, he tried to wait it out . . . nothing worked. Dude is permanently disgraced. It seems that he’s still rich, though: according to wikipedia, “Armstrong owns homes in Austin, Texas, and Aspen, Colorado, as well as a ranch in the Texas Hill Country.”

Other cases of sports cheating have more of a linear nature. Maradona didn’t have to keep punching balls into the net; once was enough, and he still got to keep his World Cup victory. If Brady Anderson doped, he just did it and that was that; no escalating behavior was necessary.

Cheating in journalism

Journalists cheat by making things up in the fashion of Mike Barnicle or Jonah Lehrer, or by reporting stories that originally appeared elsewhere without crediting the original source, which I’ve been told is standard practice at the New York Times and other media outlets. Reporting an already-told story without linking to the source is considered uncool in the blogging world but is so common in regular journalism that it’s not even considered cheating! Fabrication, though, remains a bridge too far.

Overall I’d say that cheating in journalism is like cheating in science and sports in largely being linear. Every instance of cheating leaves a hostage to fortune, so as you continue to cheat in your career, it seems likely you’ll eventually get found out for something or another, but there’s no need for an exponential increase in the amount of cheating in the way that business cheaters need to recoup larger and larger losses.

The other similarity of cheating in journalism to cheating in other fields is the continuing need for an exit strategy, with the general idea being to build up reputational credit during the fraud phase that you can then cash in during the discovery phase. That is, once enough people twig to your fraud, you are already considered too respectable or valuable enough to dispose of. Mike Barnicle is still on TV! Malcolm Gladwell is still in the New Yorker! (OK, Gladwell isn’t doing fraud, exactly: rather than knowingly publishing lies, he’s conveniently putting himself in the position where he can publish untrue and misleading statements while setting himself in some sort of veil of ignorance where he can’t be held personally to blame for these statements. He’s playing the role of a public relations officer who knows better than to check the veracity of the material he’s being asked to promote.)

Art fraud

I don’t have anything really to say about cheating in art, except that it’s a fascinating topic and much has been written about it. Art forgery involves some amusing theoretical questions, such as: if someone copies a painting or a style of a no-longer-living artist so effectively that nobody can tell the difference, is anyone harmed, other than the owners of existing work whose value is now diluted? From a business standpoint, though, art forgery seems similar to other forgery in being an essentially linear fraud, again leading to a linearly increasing set of potentially incriminating clues.

Closely related to art fraud is document fraud, for example the hilarious and horrifying (but more hilarious than horrifying) gospel of Jesus’s wife fraud, and this blurs into business fraud (the documents are being sold) and science fraud (in this case, bogus claims about history).

Similarities between cheating in business, science, sports, and journalism

Competition is a motivation for cheating. It’s hard to compete in business, science, sports, and journalism. Lots of people want to be successes and there aren’t enough slots for everyone. So if you don’t have the resources or talent or luck to succeed legitimately, cheating is an alternative path. Or if you are well situated for legitimate success, cheating can take you to the next level (I’m looking at you, Barry Bonds).

Cheating as a shortcut to success, that’s one common thread in all these fields of endeavor. There’s also cheating in politics, which I’m interested in as a political scientist, but right now I’m kinda sick of thinking about lying cheating political figures—this includes elected officials but also activists and funders (i.e., the bribers as well as the bribed)—so I won’t consider them here.

Another common thread is that you’re not supposed to cheat, so the cheater has to keep it hidden, and sometimes the coverup is, as they say, worse than the crime.

A final common thread is that business, science, sports, journalism, and art are . . . not cartels, necessary, but somewhat cooperative enterprises whose participants have a stake in the clean reputation of the entire enterprise. This motivates them to look away when they see cheating. It’s unpleasant, and it’s bad all around for the news to spread, as this could lead to increased distrust of the entire enterprise. Better to stick to positivity.

Differences

The key difference I see between these different areas is that in business it’s kinda hard to cheat by accident. In science we have Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. In business or sports we wouldn’t say that. OK, there might be some special cases, for example someone sells tons of acres of Florida swampland and is successful because he (the salesman) sincerely thinks it’s legitimate property, but in general I think of business frauds as requiring something special, some mix of inspiration, effort, and lack of scruple that most of us can’t easily assemble. A useful idiot might well be useful as part of a business fraud, but I wouldn’t think that ignorance would be a positive benefit.

In contrast, in research, a misunderstanding of scientific method can really help you out, if your goal is to produce publishable, Gladwell-able, Freakonomics-able, NPR-able, Ted-able work. The less you know and the less you think, the further you can go. Indeed, if you approach complete ignorance of a topic, you can declare that you’ve discovered an entire new continent, and a pliable news media will go with you on that. And if you’re clueless enough, it’s not cheating, it’s just ignorance!

In this dimension, sports and art seem more like business, and journalism seems more like science. Yes, you can cheat in sports without realizing it, but knowing more should allow you to be more effective at it. I can’t think of a sporting equivalent to those many scientists who produce successful lines of research by wandering down forking paths, declaring statistical significance, and not realizing what they’ve been doing.

With journalism, though, there’s a strong career path of interviewing powerful people and believing everything they say, never confronting them. To put it another way, there’s only one Isaac Chotiner, but there are lots and lots of journalists who deal in access, and I imagine that many of them are sincere, i.e. they’re misleading their readers by accident, not on purpose.

Other thoughts inspired by the book Lying for Money

I took notes while reading Davies’s book. Page references are to the Profile Books paperback edition.

p.14, “All this was known at the time.” This comes up again on p.71: “At this point, the story should have been close to its conclusion. Indeed, the main question people asked in 1982, when OPM finally gave up and went bankrupt, is why didn’t it happen three years earlier? Like a Looney Toons character, nothing seemed to stop it. New investors were brought in as the old ones gave up in disgust.” This happens all the time; indeed it was one of the things that struck me about the Theranos story was how the company thrived for nearly a decade, after various people in the company realized the emptiness of the company’s efforts.

A fraud doesn’t stay afloat all by itself; it takes a lot of effort to keep it going. This effort can include further lies, the judicious application of money, and, as with Theranos, threats and retaliation. It’s a full-time job! Really there’s no time to make up the losses or get the fictional product to work, given all the energy being spent to keep the enterprise alive for years after the fact of the fraud is out in the open.

p.17, “Fraudsters don’t play on moral weaknesses, greed or fear; they play on weaknesses in the system of checks and balances.” I guess it’s a bit of both, no? One thing I do appreciate, though, is the effort Davies puts in to not present these people as charming rogues.

I want to again point to a key difference between fraud in business and fraud in science. Business fraud requires some actual talent, or at least an unusual lack of scruple or willingness to take risks, characteristics that set fraudsters apart from the herd. In contrast, scientific misconduct often just seems to require some level of stupidity, enough so that you can push buttons, get statistical results, and draw ridiculous conclusions without looking back. Sure, ambition and unscrupulousness can help, but in most cases just being stupid seems like enough, and also is helpful in the next stage of the process when it’s time to non-respond to criticism.

p.18, “Another thing which will come up again and again is that it is really quite rare to find a major commercial fraud which was the fraudster’s first attempt. An astonishingly high proportion of the villains of this book have been found out and even served prison time, then been places in positions of trust once again.” I’m reminded of John Gribbin and John Poindexter.

Closer to home, there was this amazing—by which I mean amazingly horrible—story of a public school that was run jointly by the New York City Department of Education and Columbia University Teachers College. The principal of this school had some issues. From the news report:

In 2009 and 2010, while Ms. Worrell-Breeden was at P.S. 18, she was the subject of two investigations by the special commissioner of investigation. The first found that she had participated in exercise classes while she was collecting what is known as “per session” pay, or overtime, to supervise an after-school program. The inquiry also found that she had failed to offer the overtime opportunity to others in the school, as required, before claiming it for herself.

The second investigation found that she had inappropriately requested and obtained notarized statements from two employees at the school in which she asked them to lie and say that she had offered them the overtime opportunity.

After those findings, we learn, “She moved to P.S. 30, another school in the Bronx, where she was principal briefly before being chosen by Teachers College to run its new school.”

So, let’s get this straight: She was found to be a liar, a cheat, and a thief, and then, with that all known, she was hired to two jobs as school principal?? An associate vice president of Teachers College said, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.” In short: once you’re plugged in, you stay plugged in.

p.47: Davies talks about how online drug dealers eventually want to leave the stressful business of drug dealing, and at this point they can cash in their reputation by taking a lot of orders and then disappearing with customers’ money. An end-of-career academic researcher can do something similar if they want, using an existing reputation to promote bad ideas. Usually though you wouldn’t want to do that, as there’s no anonymity so the negative outcome can reflect badly on everything that came before. The only example I can think of offhand is the Cornell psychology researcher Daryl Bem, who is now indelibly associated with some very bad papers he wrote on extra-sensory perception. I was also gonna include Orson Welles here, as back in the 1970s he did his very best to cash in his reputation on embarrassing TV ads. But, decades later, the ads are just am amusing curiosity and Orson’s classic movies are still around: his reputation survived just fine.

p.50: “When the same features of a system keep appearing without anyone designing them, you can usually be pretty sure that the cause is economic.” Well put!

p.57: Regarding Davies’s general point about fraud preying upon a general environment of trust, I want to say something about the weaponization of trust. An example is when a researcher is criticized for making scientific errors and then turns around, in a huff, and indignantly says he’s being accused of fraud. The gambit is to move the discussion from the technical to the personal, to move from the question of whether there really is salad oil on those tanks to the question of whether the salad oil businessman can be trusted.

p.62: Davies writes, “fraud is an unusual condition; it’s a ‘tail risk.'” All I can say is, fraud might be an unusual “tail risk” in business, but in science it’s usual. It happens all the time. Just in my own career, I had a colleague who plagiarized; another one who published a report deliberately leaving out data that contradicted the story he wanted to tell; another who lied, cheated, and stole (I can’t be sure about that one as I didn’t see it personally; the story was told to me by someone who I trust); another who smugly tried to break an agreement; and another who was conned by a coauthor who made up data. That’s a lot! It’s two cases that directly affected me and three that involved people I knew personally. There was also Columbia faking its U.S. News ranking data; I don’t know any of the people involved but, as a Columbia employee, I guess that I indirectly benefited from the fraud while it was happening.

I’d guess that dishonesty is widespread in business as well. So I think that when Davies wrote “fraud is an unusual condition,” he really meant that “large-scale fraud is an unusual condition”; indeed, that would fit the rest of his discussion on p.62, where he talks about “big systematic fraud” and “catastrophic fraud loss.”

This also reminds me of the problems with popular internet heuristics such as “Hanlon’s razor,” “steelmanning,” and “Godwin’s law,” all of which kind of fall apart in the presence of actual malice, actual bad ideas, and actual Nazis. The challenge is to hold the following two ideas in your head at once:

1. In science, bad work does not require cheating; in science, honesty and transparency are not enough; just cos I say you did bad work it doesn’t mean I’m accusing you of fraud; just cos you followed the rules as you were taught and didn’t cheat it doesn’t mean you made the discovery you thought you did.

2. There are a lot of bad guys and cheaters out there. It’s typically a bad idea to assume that someone is cheating, but it’s also often a mistake to assume that they’re not.

p.65: Davies refers to a “black hole of information.” I like that metaphor! It’s another way of saying “information laundering”: the information goes into the black hole, and when it comes out its source has been erased. Traditionally, scientific journals have functioned as such a black hole, although nowadays we are more aware that, even if a claim has been officially “published,” it should still be possible to understand it in the context of the data and reasoning that have been used to justify it.

As Davies puts it on p.71, “People don’t check up on things which they believe to have been ‘signed off.’ The threat is inside the perimeter.” I’ve used that analogy too! From 2016: “the current system of science publication and publicity is like someone who has a high fence around his property but then keeps the doors of his house unlocked. Any burglar who manages to get inside the estate then has free run of the house.”

p.76: “The government . . . has some unusual characteristics as a victim (it is large, and has problems turning customers away).” This reminds me of scientific frauds, where the scientific community (and, to the extent that the junk science has potential real-world impact, the public at large) is the victim. Scientific journals have the norm of taking every submission seriously; also, a paper that is rejected from one journal can be submitted elsewhere.

p.77: “If there is enough confusion around, simply denying everything and throwing counter-accusations at your creditors can be a surprisingly effective tactic.” This reminds me of the ladder of responses to criticism.

p.78: Davies describes the expression “cool out the mark” as having been “brought to prominence by Erving Goffman.” That’s not right! Cooling out the mark was already discussed in great detail in linguist David Maurer’s classic book from 1940, The Big Con. More generally, I find Goffman irritating for reasons discussed here, so I really don’t like to see him credited for something that Maurer already wrote about.

p.114: “Certain kinds of documents are only valid with an accountant’s seal of approval, and once they have gained this seal of validity, they are taken as ‘audited accounts’ which are much less likely to be subjected to additional verification or checking.” Davies continues: “these professions are considered to be circles of trust. The idea is partly that the long training and apprenticeship processes of the profession ought to develop values of trust and honesty, and weed out candidates who do not possess them. And it is partly that professional status is a valuable asset for the person who possesses it.”

This reminds me of . . . academic communities. Not all, but much of the time. This perspective helps answer a question that’s bugged me for awhile: When researchers do bad work, why do others in their profession defend them? Just to step away from our usual subjects of economics and psychology for a moment, why were the American Statistical Association and the American Political Science Association not bothered by having giving major awards to plagiarists (see here and here)? You’d think they’d be angry about getting rooked, or at least concerned that their associations are associated with frauds. But noooo, the powers that be in these organizations don’t give a damn. The Tour de France removed Lance Armstrong’s awards, but ASA and APSA can’t be bothered. Why? One answer is that they—we!—benefit from the respect given to people in our profession. To retract awards is to admit that this respect is not always earned. Better to just let everyone quietly go about their business.

On p.124, Davies shares an amusing story of the unraveling of a scam involving counterfeit Portuguese banknotes: “While confirming them to be genuine, the inspector happened to find two notes with the same serial numbers—a genuine one had been stacked next to its twin. Once he knew what to look for, it was not too difficult to find more pairs. . . .” The birthday problem in the wild!

p.126: “mining is a sector of the economy in which standards of honesty are variable but requirements for capital are large, and you can keep raising money for a long time before you have to show results.” Kind of like some academic research and tech industries! Just give us a few more zillion dollars and eventually we’ll turn a profit . . .

p.130: “The key to any certification fraud is to exploit the weakest link in the chain.” Good point!

p.131: “It’s often a very good idea to make sure that one is absolutely clear about what a certification process is actually capable of certifying . . . Gaps like this—between the facts that a certification authority can actually make sure of, and those which it is generally assumed it can—are the making of counterfeit fraud.”

This reminds me of scientific error—not usually fraud, I think, but rather the run-of-the-mill sorts of mistakes that researchers, journals, and publicists make every day because they don’t think about the gap between what has been measured and what is being claimed. Two particularly ridiculous examples from psychology are the 3-day study that was called “long term” and the paper whose abstract concluded, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though the reported studies had no measures whatsoever of anyone “becoming more powerful,” let alone any actionable implications of such an unmeasured quantity. Again, I see no reason to think these researchers were cheating; they were just following standard practice of making strong claims that sound good but were not addressed by their data. Given that experimental scientists—people whose job is to connect measurement to a larger reality!—regularly make this sort of mistake, I guess it’s not a surprise that the same problem arises in business.

p.134: Davies writes that medical professionals “have a long training program, a strong ethical code and a lot to lose if caught in a dishonest act.” But . . . Surgisphere! Dr. Anil Potti! OK, there are bad apples in every barrel. Also, I’m sure there’s some way these dudes rationalize their deeds. Ultimately, they’re just trying to help patients, right? They’re just being slowed down by all those pesky regulations.

p.136: Davies writes, “The thing is, the certification system for pharmaceuticals is also a safety system.” I love that “The thing is.” It signals to me that Davies didn’t knock himself out writing this book. He wrote the book, it was good, he was done, it got published. When I write an article or book, I get obsessive on the details. Not that I don’t make typos, solecisms, etc., but I’m pretty careful to keep things trim. Overall I think this works, it makes my writing easier to read, but I do think Davies’s book benefits from this relaxed style, not overly worked over. No big deal, just something I noticed in different places in the book.

p.137: “Ranbaxy Laboratories . . . pleaded guilty in 2013 to seven criminal charges relating to the generic drugs it manufactured . . . it was in the habit of using substandard ingredients and manufacturing processes, and then faking test results by buying boxes of its competitors’ branded product to submit to the lab. Ranbaxy’s frauds were an extreme case (although apparently not so extreme as to throw it of the circle of trust entirely; under new management it still exists and produces drugs today).” Whaaa???

p.145: Davies refers to “the vital element of time” in perpetuating a fraud. A key point here is that uncovering the fraud is never as high a priority to outsiders as perpetuating the fraud is for the fraudsters. Even when money is at stake, the amount of money lost by each individual investor will be less than what is at stake for the perpetuator of the fraud. What this means is that sometimes the fraudster can stay alive by just dragging things out until the people on the other side get tired. That’s a standard strategy of insurance companies, right? To delay, delay, delay until the policyholder just gives up, making the rational calculation that it’s better to just cut your losses.

I’ve seen this sort of thing before, that cheaters take advantage of other people’s rationality. They play a game of chicken, acting a bit (or a lot) crazier than anyone else. It’s the madman theory of diplomacy. We’ve seen some examples recently of researchers who’ve had to deal with the aftermath of cheating collaborators, and it can be tough! When you realize a collaborator is a cheater, you’re dancing with a tiger. Someone who’s willing to lie and cheat and make up data could be willing to do all sorts of things, for example they could be willing to lie about your collaboration. So all of a sudden you have to be very careful.

p.157: “In order to find a really bad guy at a Big Four accountancy firm, you have to be quite unlucky (or quite lucky if that’s what you were looking for). But as a crooked manager of a company, churning around your auditors until you find a bad ‘un is exactly what you do, and when you do find one, you hang on to them. This means that the bad auditors are gravitationally drawn into auditing the bad companies, while the majority of the profession has an unrepresentative view of how likely that could be.”

It’s like p-hacking! Again, a key difference is that you can do bad science on purpose, you can do bad science by accident, and there are a lot of steps in between. What does it mean if you use a bad statistical method, people keep pointing out the problem, and you keep doing it? At some point you’re sliding down the Clarke’s Law slope from incompetence to fraud. In any case, my point is that bad statistical methods and bad science go together. Sloppy regression discontinuity analysis doesn’t have to be a signal that the underlying study is misconceived, but it often is, in part because (a) regression discontinuity is a way to get statistical significance and apparent causal identification out of nothing, and (b) if you are doing a careful, well-formulated study, you might well be able to model your process more likely. Theory-free methods and theory-free science often go together, and not in a good way.

p.161: “The problem is that spotting frauds is difficult, and for the majority of investors not worth spending the effort on.” Spotting frauds is a hobby, not a career or even a job. And that’s not even getting into the Javert paradox.

p.173: “The key psychological element is the inability to accept that one has made a mistake.” We’ve seen that before!

p.200: “The easier something is to manage—the more possible it is for someone to take a comprehensive view of all that’s going on, and to check every transaction individually—the more difficult it is to defraud.” This reminds me of preregistration in science. It’s harder to cheat in an environment where you’re expected to lay out all the steps of your procedure. Cheating in that context is not impossible, but it’s harder.

p.204: Davies discusses “the circumstances under which firms would form, and how the economy would tend not to the frictionless ideal, but to be made up by islands of central planning linked by bridges of price signals.” Well put. I’ve long thought this but, without having a clear formulation in words, it wasn’t so clear to me. This is the bit that made me say the thing at the top of this post, about this being the best economics book I’ve ever read.

p.229: “as laissez-faire economics was just getting off the ground, the Victorian era saw the ideology of financial deregulation grow up at the same time as, and in many cases faster and more vigorously than, financial regulation itself.” That’s funny.

p.231: “The normal state of the political economy of fraud is one of constant pressure toward laxity and deregulation, and this tends only to be reversed when things have got so bad that the whole system is under imminent threat of losing its legitimacy.” Sounds like social psychology! Regarding the application to economics and finance, I think Davies should mention Galbraith’s classic book on the Great Crash, where this laxity and deregulation thing was discussed in detail.

p.243: Davies says that stock purchases by small investors are very valuable to the market because, as a stockbroker, you can “be reasonably sure that you’re not taking too big a risk that the person selling stock to you knows something about it that you don’t.” Interesting point, I’m sure not new to any trader but interesting to me.

p.251: “After paying fines and closing down the Pittston hole, Russ Mahler started a new oil company called Quanta Resources, and somehow convinced the New York authorities that despite having the same owner, employees, and assets, it was nothing to do with the serial polluter that they had banned in 1976.” This story got me wondering: where the authorities asleep at the switched, or were they bribed, or did they just have a policy of letting fraudsters try again?

As Davies writes on p.284, “comparatively few of the case studies we’ve looked at were first offenses. . . . there’s something about the modern economic system that keeps giving fraudsters second chances and putting people back in positions of responsibility when they’ve proved themselves dishonest.” I guess he should say “political and economic system.”

Davies continues: “This is ‘white-collar crime’ we’re talking about after all; one of its defining characteristics is that it’s carried out by people of the same social class as those responsible for making decision about crime and punishment. We’re too easy on people who look and act like ourselves.” I guess so, but also it can go the other way, right? I think I’m the same social class as Cass Sunstein, but I don’t feel any desire to go easy on him; indeed, it seems to me that, with all the advantages he’s had, he has even less excuse to misrepresent research than someone who came in off the street. From the other direction, he might see me as a sort of class traitor.

p.254: “It’s a crime against the control system of the overall economy, the network of trust and agreement that makes an industrial economy livable.” That’s how I feel about Wolfram research when they hire people to spam my inbox with flattering lies. If even the classy outfits are trying to con me, what does that say about our world?

p.254: “Unless they are controlled, fraudulent business units tend to outcompete honest ones and drive them out of business.” Gresham!

p.269: “Denial, when you are not part of it, is actually a terrifying thing. One watches one’s fellow humans doing things that will damage themselves, while being wholly unable to help.” I agree. This is how I felt when corresponding with the ovulation-and-clothing researchers and with the elections-and-lifespan researchers. The people on the other side of these discussions seemed perfectly sincere; they just couldn’t consider the possibility they might be on the wrong track. (You could say the same about me, except: (1) I did consider the possibility I could be wrong in these cases, and (2) there were statistical arguments on my side; these weren’t just matters of opinion.) Anyway, setting aside if I was right or wrong in these disputes, the denial (as I perceived it) just made me want to cry. I don’t think graduate students are well trained in handling mistakes, and then when they grow up and publish research, they remain stuck in this attitude. I can see how this could be even more upsetting if real money and livelihoods are on the line.

Finally

In the last sentence of the last page of his book, Davies writes, “we are all in debt to those who trust; they are the basis of anything approaching a prosperous and civilised society.”

To which I reply, who are the trusters to which we are in debt? For example, I don’t think we are all in debt to those who trust scams such as Theranos or the Hyperloop, nor are we in debt to the Harvard professor who fell for the forged Jesus document and then tried to explain away its problems rather than just listening to the critics. Nor are we in debt to the administrations of Cornell University, Ohio State University, the University of California, etc., when they did their part to diffuse criticism of bad work being done by their faculty who had been so successful at raising money and getting publicity for their institutions.

I get Davies’s point in the context of his book: if you fall for a Wolfram Research scam (for example), you’re not the bad guy. The bad guy is Wolfram Research, which is taking advantage of your state of relaxation, tapping into the difficult-to-replenish reservoir of trust. In other settings, though, the sucker seems more complicit, not the bad guy, exactly—ultimately the responsibility falls on the fraudsters, not the promoters of the fraud—but their state of trust isn’t doing the rest of us any favors, either. So I’m not really sure what to think about this last bit.

P.S. Sean Manning reviews the book here. Perhaps surprisingly, there’s essentially no overlap between Manning’s comments and mine.

Welcome to the grand opening of the Alexey Guzey Sleep Center at the University of California!

Stephen Olivier writes:

I sometimes listen to the Tim Ferrris podcast and get email updates on his guests. Today I see he welcomed back Dr Matthew Walker of sleep infamy, and I couldn’t help noticing the final paragraph:

This is perhaps a perfect opportunity for a crowdfunding campaign to ensure that it gets the nice and accurate name it deserves.

“Note that this opportunity is in the 7-figure range.” Wow! Say what you want about this Why We Sleep guy, he’s got chutzpah!

I think it should be named the Alexey Guzey Center for Sleep Studies, or perhaps the University of California Center for Research Integrity.

I was all set to cut the check and fund this center, but then I realized . . . 7 figures, that’s a million bucks! Not that this would be a bad way to spend a million bucks, but for that same amount of money, I could pay to attend 59 exclusive off-the-record Dialog conferences, or buy 695,652 Jamaican beef patties. I choose the beef.

But if any of you have 7 figures available for this donation, you should definitely go for it. Imagine, an entire center named after you!

The causal revolution in econometrics has gone too far.

Kevin Lewis points us to this recent paper, “Can invasive species lead to sedentary behavior? The time use and obesity impacts of a forest-attacking pest,” published in Elsevier’s Journal of Environmental Economics and Management, which has the following abstract:

Invasive species can significantly disrupt environmental quality and flows of ecosystem services and we are still learning about their multidimensional impacts to economic outcomes of interest. In this work, I use quasi-random US county detections of the invasive emerald ash borer (EAB), a forest-attacking pest, to investigate how invasive-induced deforestation can impact obesity rates and time spent on physical activity. Results suggest that EAB is associated with 1–4 percentage points (pp) (mean = 37.0%) annual losses of deciduous forest cover in infested counties. After EAB detection, obesity rates are higher by 2.5pp (mean = 24.7%) and daily minutes spent on physical activity are lower by 4.9 min (mean = 51.7 min), on average. I show that less time spent on outdoor sports and exercise is one possible, but not exclusive, mechanism. Nationwide, EAB is associated with $3.0 billion in annual obesity-related healthcare costs over 2002–2012, equivalent to approximately 1.2% of total annual US medical costs related to obesity. Results are supported by many robustness and falsification tests and an alternative IV specification. This work has policy implications for invasive species management and expands our understanding of invasive species impacts on additional economic outcomes of interest.

Seeing this sort of thing makes me feel that causal revolution in econometrics has gone too far. The first part of the analysis involves invasive species and loss of forest cover. That part is ok, I guess. I don’t know anything about invasive species, but it sure sounds like loss of forest cover is the kind of thing the could cause. The problem I have is with the second part of the analysis, on obesity and time spent on outdoor sports and exercise. It just seems too much of a stretch, especially given that the whole analysis is on a county level.

To put it another way: there are lots and lots of things that could affect obesity and time spent on exercise, and invasive species reducing forest cover seems like the least of it.

From the other direction: the places where invasive species are spreading is not a random selection of U.S. counties. Places with more or less invasive species will differ in all sorts of ways, some of which might happen to be correlated with time spent on exercise, obesity, all sorts of things.

In short, I see no reason to believe the causal claims made in the article. On the other hand, it says:

A multitude of fixed effects and controls for socioeconomic and demographic confounders are used in order to isolate the EAB effect. I also estimate a suggestive first-stage model showing EAB’s impact to county-level deciduous forest cover, in order to preliminarily investigate the suspected mechanism by which EAB spread may translate into biological effects on obesity and physical activity.

The causal interpretation of my findings is supported by several checks, including: (i) an event study plot showing increasing marginal impacts of EAB over time, consistent with the biologically delayed timing of EAB-induced deforestation; (ii) falsification tests showing no impact of EAB on being underweight, no impact of EAB in the years prior to actual detection, and no impact of EAB on non-ash coniferous forest canopy; (iii) a robustness check that accounts for spatial autocorrelation in EAB detection using a Spatial Durbin Model; (iv) an investigation of biological mechanisms using daily time use diary data from the American Time Use Survey (ATUS); (v) results showing that changes in economic activity are likely not driving my findings, and; (vi) an IV specification that uses EAB detections as an instrument for deciduous forest cover to validate a suspected deforestation pathway of effect.

Sorry, but all the multitudes and Durbins and specifications and pathways don’t do it for me. Again, the pattern of invasive species is non-random, and it can vary with just about anything. So, no, I don’t agree with the claim that “This work contributes to the literature on the economics of invasive species by broadening our understanding of invasives’ true indirect costs to society.”

What’s going on here?

Remember that quote from Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”?

Another way to put it is that the story they’re trying to tell in this paper, starting with invasive species and forest cover and ending up with obesity and physical activity, is just too attenuated to be able to estimated from available data.

As I see it, there’s a misplaced empiricism going on here, an idea that by using proper econometric or statistical techniques you can obtain a “reduced-form” estimate. The trouble, as usual, is that:
1. Realistic effect sizes will be impossible to detect in the context of natural variation,
2. Forking paths allow researchers to satisfy that “aching desire” for a conclusive finding,
3. P-values, robustness tests, etc. help researchers convince themselves that the patterns they see in these data provide strong evidence for the stories they want to tell.
4. Given an existing academic tradition, researchers don’t notice 1, 2, and 3 above. They’re like the proverbial fish not seeing the water they’re swimming in.

Criticism as a collaboration between authors and audience

At this point it’s time for someone to pipe up that we’re shouldn’t be criticizing a paper we haven’t read, that we’re being mean to the author who we should’ve contacted first, who’s either a working stiff who does not deserved to be criticized by a bigshot, or else a bigshot himself who should be able ignore the pinpricks of the haters, etc etc etc.

To these (hypothetical criticisms), I reply that, no, I don’t think we should be required to spend $24.95 in order to criticize published work:

More generally, publishing a work makes it public. If you don’t want work to be doubted in public, there’s no need to publish it. Just to be clear, I’m not saying the author of the above-discussed paper is bothered by this criticism. I’m speaking more generically here.

Also, I’m fine with people publishing in paywalled journals. I do it too! Publication is a pain in the ass, and we’ll usually go with whatever journal will take our paper. It’s a weird thing because we’re providing the content and doing all the effort, and they’re then taking possession of it, but that’s how things go, and we’re typically too busy with the next project to want to buck the system on this on.

So, to continue, I hope we can see this criticism as a collaborative effort between authors and audience. The authors do the service of publishing their work rather than merely spreading it on the whisper network, and the critics do the service of posting their criticisms publicly rather than keeping it on the Q.T. and contacting the authors in secret.

Doing this in public allows everyone to be involved—including any third parties who’d like to argue that my criticisms are misplaced and we should believe the claims in the above-discussed article. Those of you who disagree with me—you should be able to see what I have to say too, not just have this locked in an email to the authors which you’ll never see.

As to my comments being critical: Yeah, I don’t think the published analysis is saying what is claimed. That’s too bad. It’s nothing personal. There are some dead-end paradigms in scientific research. It happens. We have to be looking at the big picture. We’re not doing researchers any favors by politely accepting claims that aren’t supported by the data. Indeed, take enough such claims and you can put them together and you end up with an entire junk literature which can be meta-analyzed into junk claims.

What, then, to do?

The final question is what would I recommend authors of this sort of paper to do? If I don’t believe their claims—if, indeed, I think the connection between invasive species and obesity is too tenuous for such an analysis to “work” in the sense of telling us something about the effects of invasive species on obesity, as opposed to turning up some correlations in observational data—then, given that they’re interested in this topic and they have access to these data, what should they do?

I’m not sure—maybe there’s nothing useful they can do at all here!—but, to if there is something to be gained here, my suggestion is to frame the problem observationally. These are the places with more invasive species, what’s been happening in these places, how do these places differ from otherwise-similar areas that did not have an invasive-species problem, etc. I’d say just drop the county-level obesity data entirely, but if you want to study it, look at the usual factors such as urban-rural, age, ethnic composition, etc. Learn what you can learn, forget about the big claims.

It’s worse than you might think: Passive corruption in the social sciences

When I say “passive corruption,” I’m talking not about the people who directly cheat, but about those who know about cheating but don’t do anything about it, I suspect out of some mixture of the following motivations:
1. Not wanting to waste any more time or attention on bad work
2. Fear of the social or professional consequences of confronting cheaters
3. Concern that the general air of skepticism will spread to their own work.

I was thinking about this in light of Stephanie Lee and Nell Gluckman’s new article, “A Dishonesty Expert Stands Accused of Fraud. Scholars Who Worked With Her Are Scrambling,” following up on the story we discussed a couple days ago of the Ted-talking data fakers who write books about lying and rule-breaking. Lee and Gluckman write:

To Maurice E. Schweitzer, a University of Pennsylvania business professor, it seemed logical to team up with Francesca Gino, a rising star at Harvard Business School. They were both fascinated by the unseemly side of human behavior — misleading, cheating, lying in order to profit — and together, they published eight studies over nearly a decade. Now, Schweitzer wonders if he was the one being deceived. Gino is on administrative leave from Harvard amid allegations that research she co-authored contains fabricated data . . .

The revelations have shaken and saddened the behavioral-science community. . . . some are looking with suspicion at the dishonesty researcher they once knew and trusted, a deeply disorienting sensation. A prolific body of studies, a record of headline-grabbing results, a dedication to running experiments on her own: These once looked like the hallmarks of a model scholar. . . .

“There’s so many of us who were impacted by her scholarship, by her leadership in the field,” Schweitzer told The Chronicle, “and as a co-author, as a colleague, it’s deeply upsetting.” . . . By one count, she has 148 collaborators. According to her résumé as of August 2022, she has published more than 135 articles since 2007, many of them in the field’s top journals . . .

Kouchaki and Galinsky are among Gino’s most frequent collaborators, having worked on, respectively, at least 14 and seven papers with her. Neither responded to requests for comment. . . .

From late 2016 to 2019, Gino was the editor of Organizational Behavior and Human Decision Processes, where she characterized herself as a proponent of “open science” efforts to solve her field’s replication crisis. . . .

Gino’s dozens of papers about ethical leadership and workplace behavior, done with other scholars at elite business schools, led to countless speaking and consulting gigs . . .

In 2019, an outside team published a meta-analysis of studies about dishonesty, including several of Gino’s, and tried to obtain the original data for each of them. For 10 papers that listed Gino as the first author, the team doing the analysis reported being told that the underlying data was unavailable. . . .

[Gino’s collaborator] says he did with Gino what most academics do: trust each other. “I don’t tell my Ph.D. students, ‘Never plagiarize work, never make up data,’” he said. “I assume that’s obvious.” But in hindsight, he acknowledged that it would have been better to supervise the data collection more closely. “Clearly we need to be more vigilant and less trusting than we’ve been,” he said. . . .

Still others say they are reserving judgment until they finish digging into their past work. “I am waiting to learn more about this case,” Juliana Schroeder, an associate professor at UC Berkeley’s Haas School of Business, and a seven-time collaborator with Gino, tweeted over the weekend. “It is extremely concerning.” . . .

I have a problem with this narrative.

OK, just to be clear, I don’t have any problem with Lee and Gluckman’s reporting, nor do I have a problem with the unfortunate collaborators of the researcher who was making up data. I, too, have been in collaborations where I’ve never looked at the raw data and I was never involved in the data collection. It really is all about trust, and anyone can get conned by someone who is willing to lie about things. As we discussed in our earlier post, there’s this weird thing where many prominent cheaters don’t just cheat, they also seem to love to talk about it, writing books with titles such as “Evilicious: Why We Evolved a Taste for Being Bad,” “How We Lie to Everyone—Especially Ourselves,” “Is Everybody Cheating These Days?,” and “Why it pays to break the rules at work and in life.” These people either like to live on the edge and taunt the rest of us, or they sincerely believe that “everybody is cheating.” If you’re a cheater and you regularly lie to your friends and collaborators and you write books about how it pays to break the rules, then maybe you think that normies are just saps, the academic equivalent of tourists walking around in Times Square in Bermuda shorts with wallets hanging out their back pockets.

Ok, then, so here are my problems with the narrative of the behavioral scientists who now are discovering they were too trusting:

First, these people were doing all this research on cheating. They were collaborating with a researcher who was writing books and giving speeches on how everyone’s a cheater. So why would they think it’s ok to follow blind trust? It’s almost as if they didn’t believe their own research! In tech terms, they weren’t eating their own dogfood.

Second, this has happened before and I remember how it happened. Various past cheaters got tons of institutional support. Marc Hauser finally got kicked out of Harvard, but that didn’t stop superstar academic linguist Noam Chomsky from continuing to defend him, nor did it stop superstar academic psychologists Susan Carey and Steven Pinker from supporting his later iffy venture; see P.P.S. here. Brian Wansink got kicked out of Cornell but it took awhile, and, before that happened, various people were saying we should go easy on him because he was such a nice guy. The cheater was defended by the tone police. When the problems with Matthew Walker’s sleep research came up, the University of California didn’t care. Maybe Walker was a nice guy too. With Dan Ariely it was the opposite story. After his disappearing-data problem arose, I asked around informally within the decision science community, and it seemed that there was a consensus that this guy could not be trusted. But I guess this scuttlebutt didn’t reach the administration of Duke University, nor did it seem to concern the editors of the Wall Street Journal.

Anyway, here’s my point. These people were writing papers and books about cheating, they had cheaters in their midst, they still have cheaters in their midst. And that’s not even starting on all the bad research where there’s no data fabrication or outright lying, just producing useless unreplicable research using forking paths. It’s common knowledge in the behavioral science community that tons of crap is out there and will never be retracted, also lots of people don’t speak up when they are aware of dodgy behavior; see items 1, 2, 3 above.

So here’s the deal. I’m not saying that most or even many behavioral researchers are liars, cheaters, or frauds. What I’m saying is that this is an academic community that’s consistently looked away or downplayed lying, cheating, and fraud in its midst.

As I put it a year ago in my post, Should we spend so much time talking about cheaters and fraudsters?:

Science is kind of like . . . someone poops on the carpet when nobody’s looking, some other people smell the poop and point out the problem, but the owners of the carpet insists that nothing has happened at all and refuses to allow anyone to come and clean up the mess. Sometimes they start shouting at the people who smelled the poop and call them terrorists. Meanwhile, other scientists carefully walk around that portion of the carpet: they smell something but they don’t want to look at it too closely.

A lot of business and politics is like this too. The difference is, we expect this sort of thing to happen in business and politics: rulebreakers keep doing it over and over again. Science is supposed to be different.

P.S. I’m not saying that statisticians and political scientists have any moral superiority to psychologists and experimental economists. It just happens to be easier to make up data in experimental behavioral science. Statistics is more about methods and theory, both of which are inherently replicable—if nobody else can do it, it’s not a method!—and political science mostly uses data that are more public so harder to fake (exceptions such as Mary Rosh and Michael Lacour aside).

P.P.S. Here’s another example. The first edition of the Nudge book promoted the research of Brian Wansink, which they called “masterpieces.” Then after that work was discredited, the Nudge authors removed it from their second edition. Removing work that’s known to be fatally flawed—that’s a good thing to do. But the bad thing is that, in this second edition, there’s no mention of their mistake from before! They memory-holed their earlier cheerleading for work that turned out to be fraudulent. They’re not rewriting history exactly, but they’re framing things as if the error had never existed, thus losing an opportunity to confront their error. To connect this to the main point of this post, yes, they got fooled, but also they’re setting themselves up for future problems by looking away from the problem.

Explaining the horribly wasteful U.S. heath care system as a combination of rich-countries-spend-more-on-health-care and diminishing-returns-to-health-care-spending

tl;dr. A document saying it opposes the conventional wisdom on U.S. health care spending actually, in my view, supports the conventional wisdom on U.S. health care spending.

The conventional wisdom (which I agree with)

The conventional wisdom is that Americans spend too much on health care and get too little. This conventional wisdom resonates with just about anyone who’s ever had to deal with the U.S. healthcare system, also there are graphs like this one, based on data from 2007:

The story

An anonymous correspondent points us to this analysis of U.S. healthcare costs that argues against the conventional wisdom of the reason for these healthcare costs, instead making the point that richer countries spend more on health care than poorer countries, and the U.S. is one of the richest countries in the world, so we spend a lot, so the U.S. doesn’t stand out at all from the crowd:

I don’t quite buy the above graph, as the line going through USA seems reliant on the iffy quadratic term in the regression—but even if you draw a straight line and drop the mysterious “ARE” point, the USA would not be much higher than the fitted line, only 10%-15% higher, it appears.

What does the U.S. get out of all that spending? The author of that post seems to agree that we don’t get much, writing:

America’s mediocre health outcomes can be explained by rapidly diminishing returns to spending and behavioral (lifestyle) risk factors, especially obesity, car accidents, homicide, and (most recently) drug overdose deaths. . . . The diminishing returns are evident in cross-sectional analysis:

In the earlier graph shown above, Luxembourg and Norway are also on the high end of spending relative to life expectancy, but nothing compared to the United States.

My take

My main reaction is that the main take-home point of the post is the above-cited bit about diminishing returns, which seems consistent with the conventional wisdom that the U.S. overpays for health care. Maybe Luxembourg and Norway do so too, to a lesser extent, but that doesn’t make me feel any better!

This connects to a general statistical issue that came up a few years ago, which we called the “all else equal” fallacy. The comparison of the U.S. to other countries convincingly shows that richer countries tend to spend a greater proportion of their consumption on health care, with the U.S. not standing out much except for being richer. But, as discussed above, I think this is all consistent with the conventional wisdom that we’re overpaying. Indeed I think this conventional wisdom is supported by the argument in the linked post about diminishing returns to health-care spending.

To put it another way: we in the U.S. are overpaying for health care. It might be that as other countries get as rich as we are now, they’ll overpay too, just like we do! Or maybe not. Maybe they’ll learn from our experience and decide not to spend an increasing share of their consumption on aspects of health care that aren’t improving health.

I like the general approach of the above-linked post, which is data-focused, and all about drawing the most direct inferences from available data rather than being flashily counterintuitive. It just seems to me that the title of the post, “Why conventional wisdom on health care is wrong,” does not fit its content.

The political angle

There’s also a political angle here that I don’t fully understand.

It seems to be a position on the left or center-left who argue that U.S. health care spending is out of control, and a position on the right or center-right that our system is just fine.

I kind of get this correlation. Until the Obama-era health-care law, the alternative to the U.S. system was considered to be some sort of socialized system, a national insurance or Medicare for all, and often these proposals were motivated by comparisons to Canada, France, or other systems. So if you’re a U.S. liberal, it makes sense to view the American system as worse than that in these other countries, whereas if you’re a U.S. conservative, it makes sense to argue that, despite appearances, our system is just fine, perhaps even better than elsewhere.

On the other hand, now that we have some sort of universal health insurance, maybe there’s motivation for liberals to say good things about our system and for conservatives to complain.

But, beyond all this, there’s a lot of government involvement in the U.S. health care system. Consider all the weird results of Medicare spending rules, which somehow percolate through the entire system. It would seem very natural for a liberal or a conservative to attribute some of the problems with our current system to a tangled bureaucracy. To support our crappy system just cos it’s not officially socialist . . . that just seems like a mistake. Go back to the above-linked post, see the bit about diminishing returns and the bit about richer countries being willing to spend more, put that together and you can see what a mess we’ve gotten into here! From the other direction, there’s no reason for people who see problems with our health-care system to think they need to argue with many of the points in that post.

Evil spammers waste our time.

I got an email from one of our regular commenters, a nice guy who contributes to our blog discussions when not busy at his day job. He wrote a comment that did not show up on the blog, and emailed me:

This is not at all important, but since something similar happened recently I wonder if something is up with your comment form.

I explained that the blog has a spam filter that catches hundreds of items per day, so many that it’s impossible to go through them and check for legitimate comments. If you email and tell me you have a comment that didn’t show up, I can search through the spam folder for your name and approve the comment. In this case I found one of this guy’s comments but not the other, as I’d emptied the spam folder a few days earlier.

Also every day we get a few comments that are in some intermediate state—not approved and not going straight into spam, they’re held in the main comment folder waiting for me to decide whether to approve them or send them into spam. About half of these are legit and about half are spam, for example here’s one that recently came in:

Was looking for some takes regarding this topic and I found your article quite informative. It has given me a fresh perspective on the topic tackled. Thanks!

This one’s obviously spam. Other times the comment seems to have been written by a computer program that pulls phrases from the post or from published comments, or maybe it’s an actual human who’s trying to write a comment that will be approved. In this case the commenter had a link to a homemade site selling some crap. The only thing I don’t know is if it’s written by the site owner in a desperate attempt to get some web traffic, or if he paid some sleazy online company to promote his site, and this is what they’ve come up with.

In any case, it’s horrible, not just cos it wastes my time but also because it creates the conditions under which a fun and serious legitimate commenter can’t always get through. Not the worst thing going on, just annoying vandalism. Can’t these people find some better things to do, like publishing fake psychology studies, going on NPR, and giving Ted talks?

Incompetence or fraud hidden in plain sight

We’ve been hearing a lot about the colorful con artist George Santos, who was recently elected to the U.S. Congress. One news story asks:

Why, people keep asking, did it take so long for his lies to be revealed? Why did no one think to poke deeper? Why did the people who did know something fishy was going on not speak up?

In part because he just looked so darn convincing.

This reminds me of some examples from incompetence or fraud in business or science where the evidence was sitting in plain sight but where the perpetrators remained able, Wiley E. Coyote-style, to remain suspended in the air for years:

Theranos: They faked a test in 2006, causing one of its chief executives to leave—but it wasn’t until nearly ten years later that this all caught up to them

Wansink: The main discoveries of misconduct occurred in 2016 and 2017, but the Cornell Food and Brand Lab had a track record of scientific misconduct dating back at least to 2012. Didn’t stop NPR and the Nudgelords from treating him like a hero.

Musk: We were mocking that L.A. tunnel plan back in 2018, and I’m not even particularly up on these things. But only recently has this all seemed to catch up with the auto investor.

Ariely: Apparently the problems with his work were well known, years before shreddergate came out.

But, yeah, all these people dressed the part.

And here’s another one, where the ethical problems of a biology professor were first flagged in 2010, this dude continued doing suspicious things for over a decade, and only last year did it come out. The dean of his college did a bit to quash dissent, and the story also featured legal threats from Herbalife. And, yeah, it seems that they’re still around too.

Sean offered a relevant comment on all this:

The two fundamental governing principles are 1) that its much easier to establish that a business, a person, or an argument waves red flags than to communicate that to and convince the public, and 2) that there is much more money and fame in bunking than debunking, and a heavy cost to debunking (because the hucker or hucksters and many of their victims are motivated to push back). The wise (or at least *homines economici*) see that, investigate claims which seem to be too good to be true, quietly warn friends about what they find, and move on.

“This is a story some economists like to tell . . .”

I was thinking more about the this story, where a series of economists took an story based on someone’s childhood memories and elaborated upon it in different ways until it eventually appeared in a major journal in the field. A few years after that, enough people got irritated that the journal ran a correction:

There is an error in “Self-Control at Work” (Kaur, Kremer, and Mullainathan 2015), published in the October 2015 edition of this journal (vol. 123, no. 6). In section VI, on page 1274, the paper includes the following incorrect quote from a paper by Steven N. S. Cheung:

The second view—that joint production necessitates the need for monitoring (Alchian and Demsetz 1972)—is summarized in a story by Steven Cheung (1983, 8): “On a boat trip up China’s Yangtze River in the 19th Century, a titled English woman complained to her host of the cruelty to the oarsmen. One burly coolie stood over the rowers with a whip, making sure there were no laggards. Her host explained that the boat was jointly owned by the oarsmen, and that they hired the man responsible for flogging.”

While the incorrect quote also appears in other earlier sources, it does not appear in Cheung’s original article. [For example, the incorrect version of the quote also appears in Jensen et al. (1998).]

The accurate quotation from Cheung (1983, 8) is as follows:

My own favorite example is riverboat pulling in China before the communist regime, when a large group of workers marched along the shore towing a good-sized wooden boat. The unique interest of this example is that the collaborators actually agreed to the hiring of a monitor to whip them. The point here is that even if every puller were perfectly “honest,” it would still be too costly to measure the effort each has contributed to the movement of the boat, but to choose a different measurement agreeable to all would be so difficult that the arbitration of an agent is essential.

The inaccurate quote was included simply as a way to illustrate the idea that joint production might necessitate the need for monitoring. . . . However, the quote is in no way central to the core point of the paper, or even for the discussion in section VI of the paper. . . . Consequently, this incorrect quote can be omitted from the paper without any impact on the substance of the paper.

To start with, the story was changed in many important ways! To call this an “incorrect quote” is an extreme understatement of what happened here. Second, that last bit about removal having no impact on the substance—it makes me wonder why it was included in the article at all. Surely it must have some impact, no?

After reading through the comments to the above-linked post and thinking more about this, I think I’ve come up with an answer.

How was the story changed?

The revised story has several elaborations, most of which seem like the result of unthinking ignorance:

– Changing from an unspecified river to the Yangtze: That’s the #1 river that Americans think of, when they hear about a river in China. According to the original teller of the story, it was a “journey from Liuzhou to Guiping,” which according to the map is not near the Yangtze. Kind of like how that buffoonish business-school professor took a story about the Alps and moved it to Switzerland.

– Changing from the 20th century to the 19th: The scenario sounds old-fashioned, so the storyteller unthinkingly moves it back in time.

– Introducing the word “coolie”: Adding this slur contributes to placing the story in the more distant past. We wouldn’t refer to a modern worker as a coolie, partly because it’s rude but also because it’s an old-fashioned word to use, even descriptively.

– Adding the physical description, “burly”: This is the kind of detail that can make a story seem more real; also, “burly” is another old-fashioned word, again placing the story back in the mists of time.

– Changing from riverboat pullers to oarsmen: A boat being rowed is more familiar than a boat being pulled, so if you’re telling from memory a story that you’ve never fully visualized, you might unthinkingly make that change.

But three of the elaborations are particularly striking because they don’t just elaborate the story, they also make it more compatible with the usual ideology of academic economists:

– Changing from “the collaborators actually agreed to the hiring of a monitor to whip them” to “the boat was jointly owned by the oarsmen, and that they hired the man responsible for flogging”: In this new version, the workers actually own the boat. As independent agents—owners of capital, in fact!—these workers are now unambiguously hiring the whipper out of their own free will, not acting out of some desperate economic necessity.

– Adding the “titled English woman”: At first this elaboration might seem the most puzzling, as it transforms a Chinese refugee into an upper-class foreigner, introducing an entirely new element to the story. From the economists’ perspective, though, it’s perfect, as this upper-class twit is a perfect foil to the down-to-earth economists who understand the real world (as here and here, for example).

– Adding the bit where the woman “complained to her host of the cruelty to the oarsmen”: This bit helps to make the woman unsympathetic (she “complains”; it is her host who gets to “explain”) and also reinforces the idea of the economists as taboo-busters who can marshal the cold facts to support apparent “cruelty.”

Whether or not the story is “central to the core point of the paper,” it does seem central to a certain way that economists often present themselves, and I do think some reflection on their part is in order.

What should the correction notice have said?

As discussed earlier, I was unsatisfied by the original correction notice (“this incorrect quote can be omitted from the paper without any impact on the substance of the paper”), both because “incorrect quote” doesn’t begin to describe the many ways that the story from Cheung (1983, 8) was changed, and second because . . . what does it mean that they included this story that had no impact on the substance? That’s not usually done in academic articles, is it? Even a story is merely illustrative, the fact that it presumably actually happened is relevant, no? To put it another way, if the best story you can use to illustrate a point is a made-up story, that in itself should be telling you something.

Actually, the quote is not inaccurate at all! It’s a direct quote from one of the practice exam questions in Jensen et al. (1998). The inaccuracy was in the attribution to Cheung and, indirectly, in repeating a highly-distorted version of a story without noting the distortion.

In any case, I think the inclusion of this ridiculous story in the published article is informative. Not directly informative on the economic theory described in the paper, but indirectly informative, in that this fake-o story about the “titled English woman” has spread widely among economists—indeed, so widely that the authors write, “the incorrect quote [sic] also appears in other earlier sources,” and so widely that they didn’t even think to check it, they just blindly attributed it to Cheung (1983, 8). A story so well known it didn’t need to be checked.

Now that’s interesting to me—that this elaborate story, originally based on someone’s childhood memory and then transposed to a different part of China, in a different century, with an entirely different cast of characters, became common currency in some academic circles.

It’s interesting what people will believe without questioning, if it fits with their model of the world. In this case, hard-nosed economists, self-employed laborers, boat rowing on the Yangtze, and, as a foil, a soft-hearted upper-class reformer who doesn’t understand the real world.

So here’s what I think the correction notice should have said:

In our paper, we attributed to Cheung (1983, 8) a story that actually appeared in Jensen et al. (1998). The Chung (1983) story is:

My own favorite example is riverboat pulling in China before the communist regime, when a large group of workers marched along the shore towing a good-sized wooden boat. The unique interest of this example is that the collaborators actually agreed to the hiring of a monitor to whip them.

Further background is supplied by Chung (2018):

In 1970, Toronto’s John McManus was my guest in Seattle. I chatted to him about what happened when I was a refugee in wartime Guangxi. The journey from Liuzhou to Guiping was by river, and there were men on the banks whose job was to drag the boat with ropes. There was also an overseer armed with a whip. According to my mother, the whipper was hired to do just that by the boatmen!

Here is the story as printed in Jensen et al. (1998), which is a collection of practice exam questions:

On a boat trip up China’s Yangtze River in the 19th Century, a titled English woman complained to her host of the cruelty to the oarsmen. One burly coolie stood over the rowers with a whip, making sure there were no laggards. Her host explained that the boat was jointly owned by the oarsmen, and that they hired the man responsible for flogging. (Source: Steven Chung [sic].) Explain why such an organizational arrangement would arise voluntarily.

This differs from Cheung’s original version in several minor ways (moving the location from Guangxi to the Yangtze river, changing the time from the twentieth to the nineteenth century, changing from boat pulling to rowing, and adding the colorful phrase “burly coolie”) and in a few major ways (stating that the laborers were owners of the boat, which was not in the original story, changing the female character from a Chinese refugee to a “titled English woman,” having the woman “complain,” and adding a new character whose job is to explain the situation to her.

This is a story some economists like to tell. Our retelling of this story that so neatly fits our theoretical model, without reflection on the story’s fictional nature, perhaps reflects an excess of faith on our part, and suggests we should be careful when trying to apply this model to the real world.

That would do it. Or, if such a correction would be too long, here’s something shorter, to the point, and without defensiveness:

In our paper, we attributed to Cheung (1983, 8) a story that actually appeared in Jensen et al. (1998), which is a collection of practice exam questions. The story as related by Jensen et al. and copied by us is a much distorted version of Cheung (1983), which according to Cheung (2018) derives from a memory of a story told to him as a child.

They could leave it to the readers to decide whether this error affected the substance of the paper.

OK, I was wrong about Paul Samuelson.

Regarding my post from the other day, someone who knows much more about Samuelson than I do provides some helpful background:

It is your emphasis on “realistic” that is wrong. Paul played a significant role in the random walk model for stock prices and knew a huge amount about both theory and practice, as part of the MIT group, including others such as Bob Solow. He had no shades on his eyes – but he knew that the model fit well enough that betting on indexes was for most people as good as any strategy. But how to convey this in an introductory text? Most people would go to simple random walks – coin tossing. But that was far from realistic. Stock prices jumped irregularly and were at that time limited to something like half shares or quarter shares. It is a clever idea to think of a sort of random draw of actual price changes as a device to teach students what was going on. Much more realistic. I cannot believe he ever said this was exactly how it worked. Text books have to have a degree of idealization if they are to make a complex subject understandable. Paul was certainly not a pompous fool. Economics was a no holds barred field in those days and all models were actively criticized from both theoretical and empirical sides. His clever instructional idea was indeed more realistic and effective as well. He did get the Soviet economy wrong, but so dd every other economist.

Regarding that last point, I still maintain that Samuelson’s problem was not getting the Soviet economy wrong, but rather that he didn’t wrestle with his error in later editions of his book; see third and fourth paragraph of this comment. But, sure, that’s just one thing. Overall I get my correspondent’s point, and I no longer think the main substance of my earlier post was correct.

Maybe Paul Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

Yesterday we pointed to a post by Gary Smith, “Don’t worship math: Numbers don’t equal insight,” subtitled, “The unwarranted assumption that investing in stocks is like rolling dice has led to some erroneous conclusions and extraordinarily conservative advice,” that included a wonderful story that makes the legendary economist Paul Samuelson look like a pompous fool. Here’s Smith:

Mathematical convenience has often trumped common sense in financial models. For example, it is often assumed — because the assumption is useful — that changes in stock prices can be modeled as independent draws from a probability distribution. Paul Samuelson offered this analogy:

Write down those 1,800 percentage changes in monthly stock prices on as many slips of paper. Put them in a big hat. Shake vigorously. Then draw at random a new couple of thousand tickets, each time replacing the last draw and shaking vigorously. That way we can generate new realistically representative possible histories of future equity markets.

I [Smith] did Samuelson’s experiment. I put 100 years of monthly returns for the S&P 500 in a computer “hat” and had the computer randomly select monthly returns (with replacement) until I had a possible 25-year history. I repeated the experiment one million times, giving one million “Samuelson simulations.”

I also looked at every possible starting month in the historical data and determined the very worst and very best actual 25-year investment periods. The worst period began in September 1929, at the start of the Great Crash. An investment over the next 25 years would have had an annual return of 5.1%. The best possible starting month was January 1975, after the 1973-1974 crash. The annual rate of return over the next 25 years was 17.3%.

In the one million Samuelson simulations, 9.6% of the simulations gave 25-year returns that were worse than any 25-year period in the historical data and 4.9% of the simulations gave 25-year returns that were better than any actual 25-year historical period. Overall, 14.5% of the Samuelson simulations gave 25-year returns that were too extreme. Over a 50-year horizon, 24.5% of the Samuelson simulations gave 50-year returns that were more extreme than anything that has ever been experienced.

You might say that Smith is being unfair, as Samuelson was only offering a simple mathematical model. But it was Samuelson, not Smith, who characterized his random drawing as “realistically representative possible histories of future equity markets.” Samuelson was the one claiming realism.

My take is that Samuelson wanted it both ways. He wanted to show off his math, but he also wanted relevance, hence his “realistically.”

The prestige of economics comes partly from its mathematical sophistication but mostly because it’s supposed to relate to the real world.

Smith’s example of Samuelson’s error reminded me of this story from David Levy and Sandra Peart of this graph from the legendary textbook. This is from 1961:

samuelson.png

Alex Tabarrok pointed out that it’s even worse than it looks: “in subsequent editions Samuelson presented the same analysis again and again except the overtaking time was always pushed further into the future so by 1980 the dates were 2002 to 2012. In subsequent editions, Samuelson provided no acknowledgment of his past failure to predict and little commentary beyond remarks about ‘bad weather’ in the Soviet Union.”

The bit about the bad weather is funny. If you’ve had bad weather in the past, maybe the possibility of future bad weather should be incorporated into the forecast, no?

Is there a connection?

Can we connect Samuelson’s two errors?

Again, the error with the Soviet economy forecast is not that he was wrong in the frenzied post-Sputnik year of 1961; the problem is that he kept making this error in his textbook for decades to come. Here’s another bit, from Larry White:

As late as the 1989 edition [Samuelson] coauthor William Nordhaus wrote: ‘The Soviet economy is proof that, contrary to what many skeptics had earlier believed, a socialist command economy can function and even thrive.’

I see three similarities between the stock-market error and the command-economy error:

1. Love of simple mathematical models: the random walk in one case and straight trends in the other. The model’s so pretty, it’s too good to check.

2. Disregard of data. Smith did that experiment disproving Samuelson’s claim. Samuelson could’ve done that experiment himself! But he didn’t. That didn’t stop him from making a confident claim about it. As for the Soviet Union, by the time 1980 had come along Samuelson had 20 years of data refuting his original model, but that didn’t stop him from just shifting the damn curve. No sense that, hey, maybe the model has a problem!

3. Technocratic hubris. There’s this whole story about how Samuelson was so brilliant. I have no idea how brilliant he was—maybe standards were lower back then?—but math and reality don’t care how brilliant you are. I see a connection between Samuelson thinking that he could describe the stock market with a simple random walk model, and him thinking that the Soviets could just pull some levers and run a thriving economy. Put the experts in charge, what could go wrong, huh?

More stories

Smith writes:

As a student, Samuelson reportedly terrorized his professors with his withering criticisms.

Samuelson is of course the uncle of Larry Summers, another never-admit-a-mistake guy. There is a story about Summers saying something stupid to Samuelson a week before Arthur Okun’s funeral. Samuelson reportedly said to Summers, “In my eulogy for Okun, I’m going to say that I don’t remember him ever saying anything stupid. Well, now I won’t be able to say that about you.”

There was a famous feud between Samuelson and Harry Markowitz about whether investors should think about arithmetic or geometric means. In one Samuelson paper responding to Markowitz, every word (other than author names) was single syllable.

I once gave a paper at a festschrift honoring Tobin. Markowitz began his talk by graciously saying to Samuelson, who was sitting arm-crossed in the front row, “In the spirit of this joyous occasion, I would like to say to Paul that ‘Perhaps there is some merit in your argument.’” Samuelson immediately responded, “I wish I could say the same.”

Here’s the words-of-one-syllable paper, and here’s a post that Smith found:

Maybe Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

P.S. OK, I was wrong.

The mistake comes when it is elevated from a heuristic to a principle.

Gary Smith pointed me to this post, “Don’t worship math: Numbers don’t equal insight,” subtitled, “The unwarranted assumption that investing in stocks is like rolling dice has led to some erroneous conclusions and extraordinarily conservative advice,” which reminded me of my discussion with Nate Silver a few years ago regarding his mistaken claim that, “the most robust assumption is usually that polling is essentially a random walk, i.e., that the polls are about equally likely to move toward one or another candidate, regardless of which way they have moved in the past.” My post was called, “Politics is not a random walk: Momentum and mean reversion in polling,” and David Park, Noah Kaplan, and I later expanded that into a paper, “Understanding persuasion and activation in presidential campaigns: The random walk and mean-reversion models.”

The random walk model for polls is a bit like the idea that the hot hand is a fallacy: it’s an appealing argument that has a lot of truth to it (as compared to the alternative model that poll movement or sports performance is easily predictable given past data) but is not quite correct, and the mistake comes when it is elevated from a heuristic to a principle.

This mistake happens a lot, no? It comes up in statistics all the time.

P.S. Some discussion in comments on stock market and investing. I know nothing about that topic; the above post is just about the general problem of people elevating a heuristic to a principle.

“The market can become rational faster than you can get out”

Palko pointed me to one of these stories about a fraudulent online business that crashed and burned. I replied that it sounded a lot like Theranos. The conversation continued:

Palko: Sounds like all the unicorns. The venture capital model breeds these things.

Me: Unicorns aren’t real, right?

Palko: Unicorns are mythical beasts and those who invest in them are boobies.

Me: Something something longer than something something stay solvent.

Palko: That’s good advice for short sellers, but it’s good to remember the corollary: the market can become rational faster than you can get out.

Good point.