Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Fully funded doctoral student positions in Finland

There is a new government funded Finnish Doctoral Program in AI. Research topics include Bayesian inference, modeling and workflows as part of fundamental AI. There is a big joint call, where you can choose the supervisor you want to work with. I (Aki) am also one of the supervisors. Come work with me or share the news! The first call deadline is April 2, and the second call deadline in fall 2024. See how to apply at https://fcai.fi/doctoral-program, and more about my research at my web page.

Zotero now features retraction notices

David Singerman writes:

Like a lot of other humanities and social sciences people I use Zotero to keep track of citations, create bibliographies, and even take & store notes. I also am not alone in using it in teaching, making it a required tool for undergraduates in my classes so they learn to think about organizing their information early on. And it has sharing features too, so classes can create group bibliographies that they can keep using after the semester ends.

Anyway my desktop client for Zotero updated itself today and when it relaunched I had a big red banner informing me that an article in my library had been retracted! I didn’t recognize it at first, but eventually realized that was because it was an article one of my students had added to their group library for a project.

The developers did a good job of making the alert unmissable (i.e. not like a corrections notice in a journal), the full item page contains lots of information and helpful links about the retraction, and there’s a big red X next to the listing in my library. See attached screenshots.

The way they implemented it will also help the teaching component, since a student will get this alert too.

Singerman adds this P.S.:

This has reminded me that some time ago you posted something about David Byrne, and whatever you said, it made me think of David Byrne’s wonderful appearance on the Colbert Report.

What was amazing to me when I saw it was that it’s kind of like a battle between Byrne’s inherent weirdness and sincerity, and Colbert’s satirical right-wing bloviator character. Usually Colbert’s character was strong enough to defeat all comers, but . . . decide for yourself.

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Refuted papers continue to be cited more than their failed replications: Can a new search engine be built that will fix this problem?

Paul von Hippel writes:

Stuart Buck noticed your recent post on A WestLaw for Science. This is something that Stuart and I started talking about last year, and Stuart, who trained as an attorney, believes it was first suggested by a law professor about 15 years ago.

Since the 19th century, the legal profession has had citation indices that do far more than count citations and match keywords. Resources like Shepard’s Citations—first printed in 1873 and now published online along with competing tools such as JustCite, KeyCite, BCite, and SmartCite—do not just find relevant cases and statutes; they show lawyers whether a case or statute is still “good law.” Legal citation indexes show lawyers which cases have been affirmed or cited approvingly, and which have been criticized, reversed, or overruled by later courts.

Although Shepard’s Citations inspired the first Science Citation Index in 1960, which in turn inspired tools like Google Scholar, today’s academic search engine still rely primarily on citation counts and keywords. As a result, many scientists are like lawyers who walk into the courtroom unaware that a case central to their argument has been overruled.

Kind of, but not quite. A key difference is that in the courtroom there is some reasonable chance that the opposing lawyer or the judge will notice that the key case has been overruled, so that your argument that hinges on that case will fail. You have a clear incentive to not rely on overruled cases. In science, however, there’s no opposing lawyer and no judge: you can build an entire career on studies that fail to replicate, and no problem at all, as long as you don’t pull any really ridiculous stunts.

Hippel continues:

Let me share a couple of relevant articles that we recently published.

One, titled “Is Psychological Science Self-Correcting?, reports that replication studies, whether successful or unsuccessful, rarely have much effect on citations to the studies being replicated. When a finding fails to replicate, most influential studies sail on, continuing to gather citations at a similar rate for years, as though the replication had never been tried. The issue is not limited to psychology and raises serious questions about how quickly the scientific community corrects itself, and whether replication studies are having the correcting influence that we would like them to have. I considered several possible reasons for the persistent influence on studies that failed to replicate, and concluded that academic search engines like Google Scholar may well be part of the problem, since they prioritize highly cited articles, replicable or not, perpetuating the influence of questionable findings.

The finding that replications don’t affect citations has itself replicated pretty well. A recent blog post by Bob Reed at the University of Canterbury, New Zealand, summarized five recent papers that showed more or less the same thing in psychology, economics, and Nature/Science publications.

In a second article, published just last week in Nature Human Behaviour, Stuart Buck and I suggest ways to Improve academic search engines to reduce scholars’ biases. We suggest that the next generation of academic search engines should do more than count citations, but should help scholars assess studies’ rigor and reliability. We also suggest that future engines should be transparent, responsive and open source.

This seems like a reasonable proposal. The good news is that it’s not necessary for their hypothetical new search engine to dominate or replace existing products. People can use Google Scholar to find the most cited papers and use this new thing to inform about rigor and reliability. A nudge in the right direction, you might say.

A new piranha paper

Kris Hardies points to this new article, Impossible Hypotheses and Effect-Size Limits, by Wijnand and Lennert van Tilburg, which states:

There are mathematical limits to the magnitudes that population effect sizes can take within the common multivariate context in which psychology is situated, and these limits can be far more restrictive than typically assumed. The implication is that some hypothesized or preregistered effect sizes may be impossible. At the same time, these restrictions offer a way of statistically triangulating the plausible range of unknown effect sizes.

This is closely related to our Piranha Principle, which we first formulated here and then followed up with this paper. It’s great to see more work being done in this area.

Statistical practice as scientific exploration

This was originally going to happen today, 8 Mar 2024, but it got postponed to some unspecified future date, I don’t know why. In the meantime, here’s the title and abstract:

Statistical practice as scientific exploration

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? After a brief review of that topic (in short, I am a Bayesian but not an inductivist), I discuss the ways in which researchers when using and developing statistical methods are acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modeling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formally tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow, as described in part in this article: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

The whole thing is kind of mysterious to me. In the email invitation it was called the UPenn Philosophy of Computation and Data Workshop, but then they sent me a flyer where it was called the Philosophy of A.I., Data Science, & Society Workshop in the Quantitative Theory and Methods Department at Emory University. It was going to be on zoom so I guess the particular university affiliation didn’t matter.

In any case, the topic is important, and I’m always interested in speaking with people on the philosophy of statistics. So I hope they get around to rescheduling this one.

Relating t-statistics and the relative width of confidence intervals

How much does a statistically significant estimate tell us quantitatively? If you have an estimate that’s statistically distinguishable from zero with some t-statistic, what does that say about your confidence interval?

Perhaps most simply, with a t-statistic of 2, your 95% confidence intervals will nearly touch 0. That is, they’re just about 100% wide in each direction. So they cover everything from nothing (0%) to around double your estimate (200%).

More generally, for a 95% confidence interval (CI), 1.96/t — or let’s say 2/t — gives the relative half-width of the CI. So for an estimate with t=4, then everything from half your estimate to 150% of your estimate is in the 95% CI.

For other commonly-used nominal coverage rates, the confidence intervals have a width that is less conducive to a rule of thumb, since the critical value isn’t something nice like ~2. (For example, with 99% CIs, the Gaussian critical value is 2.58.) Let’s look at 90, 95, and 99% confidence intervals for t = 1.96, 3, 4, 5, and 6:

Confidence intervals on the scale of the estimate

You can see, for example, that even at t=5, the halved point estimate is still inside the 99% CI. Perhaps this helpfully highlights how much more precision you need to confidently state the size of an effect than just to reject the null.

These “relative” confidence intervals are just this smooth function of t (and thus the p-value), as displayed here:

confidence intervals on scale of the estimate by p-value and t-statistic

It is only when the statistical evidence against the null is overwhelming — “six sigma” overwhelming or more —that you’re also getting tight confidence intervals in relative terms. Among other things, this highlights that if you need to use your estimates quantitatively, rather than just to reject the null, default power analysis is going to be overoptimistic.

A caveat: All of this just considers standard confidence intervals based on normal theory labeled by their nominal coverage. Of course, many p < 0.05 estimates may have been arrived at by wandering through a garden of forking paths, or precisely because it passed a statistical significance filter. Then these CIs are not going to conditionally have their advertised coverage.

With journals, it’s all about the wedding, never about the marriage.

John “not Jaws” Williams writes:

Here is another example of how hard it is to get erroneous publications corrected, this time from the climatology literature, and how poorly peer review can work.

From the linked article, by Gavin Schmidt:

Back in March 2022, Nicola Scafetta published a short paper in Geophysical Research Letters (GRL) . . . We (me, Gareth Jones and John Kennedy) wrote a note up within a couple of days pointing out how wrongheaded the reasoning was and how the results did not stand up to scrutiny. . . .

After some back and forth on how exactly this would work (including updating the GRL website to accept comments), we reformatted our note as a comment, and submitted it formally on December 12, 2022. We were assured from the editor-in-chief and publications manager that this would be a ‘streamlined’ and ‘timely’ review process. With respect to our comment, that appeared to be the case: It was reviewed, received minor comments, was resubmitted, and accepted on January 28, 2023. But there it sat for 7 months! . . .

The issue was that the GRL editors wanted to have both the comment and a reply appear together. However, the reply had to pass peer review as well, and that seems to have been a bit of a bottleneck. But while the reply wasn’t being accepted, our comment sat in limbo. Indeed, the situation inadvertently gives the criticized author(s) an effective delaying tactic since, as long as a reply is promised but not delivered, the comment doesn’t see the light of day. . . .

All in all, it took 17 months, two separate processes, and dozens of emails, who knows how much internal deliberation, for an official comment to get into the journal pointing issues that were obvious immediately the paper came out. . . .

The odd thing about how long this has taken is that the substance of the comment was produced extremely quickly (a few days) because the errors in the original paper were both commonplace and easily demonstrated. The time, instead, has been entirely taken up by the process itself. . . .

Schmidt also asks a good question:

Why bother? . . . Why do we need to correct the scientific record in formal ways when we have abundant blogs, PubPeer, and social media, to get the message out?

His answer:

Since journals remain extremely reluctant to point to third party commentary on their published papers, going through the journals’ own process seems like it’s the only way to get a comment or criticism noticed by the people who are reading the original article.

Good point. I’m glad that there are people like Schmidt and his collaborators who go to the trouble to correct the public record. I do this from time to time, but mostly I don’t like the stress of dealing with the journals so I’ll just post things here.

My reaction

This story did not surprise me. I’ve heard it a million times, and it’s often happened to me, which is why I once wrote an article called It’s too hard to publish criticisms and obtain data for replication.

Journal editors mostly hate to go back and revise anything. They’re doing volunteer work, and they’re usually in it because they want to publish new and exciting work. Replications, corrections, etc., that’s all seen as boooooring.

With journals, it’s all about the wedding, never about the marriage.

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Defining optimal reliance on model predictions in AI-assisted decisions

This is Jessica. In a previous post I mentioned methodological problems with studies of AI-assisted decision-making, such as are used to evaluate different model explanation strategies. The typical study set-up gives people some decision task (e.g., Given the features of this defendant, decide whether to convict or release), has them make their decision, then gives them access to a model’s prediction, and observes if they change their mind. Studying this kind of AI-assisted decision task is of interest as organizations deploy predictive models to assist human decision-making in domains like medicine and criminal justice. Ideally, the human is able to use the model to improve the performance they’d get on their own or if the model was deployed without a human in the loop (referred to as complementarity). 

The most frequently used definition of appropriate reliance is that if the person goes with the model prediction but it’s wrong, this is overreliance. If they don’t go with the model prediction but it’s right, this is labeled underreliance. Otherwise it is labeled appropriate reliance. 

This definition is problematic for several reasons. One is that the AI might have a higher probability than the human of selecting the right action, but still end up being wrong. It doesn’t make sense to say the human made the wrong choice by following it in such cases. Because it’s based on post-hoc correctness, this approach confounds two sources of non-optimal human behavior: not accurately estimating the probability that the AI is correct versus not making the right choice of whether to go with the AI or not given one’s beliefs. 

By scoring decisions in action space, it also equally penalizes not choosing the right action (which prediction to go with) in a scenario where the human and the AI have very similar probabilities of being correct and one where either the AI or human has a much higher probability of being correct. Nevertheless, there are many papers doing it this way, some with hundreds of citations. 

In A Statistical Framework for Measuring AI Reliance, Ziyang Guo, Yifan Wu, Jason Hartline and I write: 

Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s prediction from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational agent facing the same decision task as the behavioral agents.

It’s a similar approach to our rational agent framework for data visualization, but here we assume a setup in which the decision-maker receives a signal consisting of the feature values for some instance, the AI’s prediction, the human’s prediction, and optionally some explanation of the AI decision. The decision-maker chooses which prediction to go with. 

We can compute the upper bound or best attainable performance in such a study (rational benchmark) as the expected score of a rational decision-maker on a randomly drawn decision task. The rational decision-maker has prior knowledge of the data generating model (the joint distribution over the signal and ground truth state). Seeing the instance in a decision trial, they accurately perceive the signal, arrive at Bayesian posterior beliefs about the distribution of the payoff-relevant state, then choose the action that maximizes their expected utility over the posterior. We calculate this in payoff space as defined by the scoring rule, such that the cost of an error can vary in magnitude. 

We can define the “value of rational complementation” for the decision-problem at hand by also defining the rational agent baseline: the expected performance of the rational decision-maker without access to the signal on a randomly chosen decision task from the experiment. Because it represents the score the rational agent would get if they could rely only on their prior beliefs about the data-generating model, the baseline is the expected score of a fixed strategy that always chooses the better of the human alone or the AI alone.

figure showing human alone, then ai alone, then (with ample room between them) rational benchmark

If designing or interpreting an experiment on AI reliance, the first thing we might want to do is look at how close to the benchmark the baseline is. We want to see a decent amount of room for the human-AI team to improve performance over the baseline, as in the image above. If the baseline is very close to the benchmark, it’s probably not worth adding the human.

Once we have run an experiment and observed how well people make these decisions, we can treat the value of complementation as a comparative unit for interpreting how much value adding the human contributes over making the decision with the baseline. We do this by normalizing the observed score within the range where the rational agent baseline is 0 and the rational agent benchmark is 1 and looking at where the observed human+AI performance lies. This also provides a useful sense of effect size when we are comparing different settings. For example, if we have two model explanation strategies A and B we compared in an experiment, we can calculate expected human performance on a randomly drawn decision trial under A and under B and measure the improvement by calculating (score_A − score_B)/value of complementation. 

figure showing human alone, then ai alone, then human+ai, then rational benchmark

We can also decompose sources of error in study participants’ performance. To do this, we define a “mis-reliant” rational decision-maker benchmark, which is the expected score of a rational agent constrained to the reliance level that we observe in study participants. Hence this is the best score a decision-maker who relies on the AI the same overall proportion of the time could attain had they perfectly perceived the probability that the AI is correct relative to the probability that the human is correct on every decision task. Since the mis-reliant benchmark and the study participants have the same reliance level (i.e., they both accept the AI’s prediction the same percentage of the time), the difference in their decisions lies entirely in accepting the AI predictions at different instances. The mis-reliant rational decision-maker always accepts the top X% AI predictions ranked by performance advantage over human predictions, but study participants may not.

figure showing human alone, ai alone, human+ai, misreliant rational benchmark, rational benchmark

By calculating the mis-reliant rational benchmark for the observed reliance level of study participants, we can distinguish between reliance loss, the loss from over- or under-relying on the AI (defined as the difference between the rational benchmark and mis-reliant benchmark divided by the value of rational complementation), and discrimination loss, the loss from not accurately differentiating the instances where the AI is better than the human from the ones where the human is better than the AI (defined as the difference between the mis-reliant benchmark and the expected score of participants divided by the value of rational complementation). 

We apply this approach to some well-known studies on AI reliance and can extend the original interpretations to varying degrees, ranging from observing a lack of potential to see complementarity in the study given how much better the AI was going in than the human, to the original interpretation missing that participants’ reliance levels were pretty close to rational and they just couldn’t distinguish which signals they should go with AI on. We also observe researchers making comparisons across conditions for which the upper and lower bounds on performance differ without accounting for the difference.

There’s more in the paper – for example, we discuss how the rational benchmark, our upper bound representing the expected score of a rational decision-maker on a randomly chosen decision task, may be overfit to the empirical data. This occurs when the signal space is very large (e.g., the instance is a text document) such that we observe very few human predictions per signal. We describe how the rational agent could determine the best response on the optimal coarsening of the empirical distribution, such that the true rational benchmark is bounded by this and the overfit upper bound. 

While we focused on showing how to improve research on human-AI teams, I’m excited about the potential for this framework to help organizations as they consider whether deploying an AI is likely to improve some human decision process. We are currently thinking about what sorts of practical questions (beyond Could pairing a human and AI be effective here?) we can answer using such a framework.

Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile)

This is a long post, so let me give you the tl;dr right away: Don’t use the word “whom” in your dating profile.

OK, now for the story. Fasten your seat belts, it’s going to be a bumpy night.

It all started with this message from Dmitri with subject line, “Man I hate to do this to you but …”, which continued:

How could I resist?

https://www.cnbc.com/2024/02/15/using-this-word-can-make-you-more-influential-harvard-study.html

I’m sorry, let me try again … I had to send this to you BECAUSE this is the kind of obvious shit you like to write about. I like how they didn’t even do their own crappy study they just resurrected one from the distant past.

OK, ok, you don’t need to shout about it!

Following the link we see this breathless press release NBC news story:

Using this 1 word more often can make you 50% more influential, says Harvard study

Sometimes, it takes a single word — like “because” — to change someone’s mind.

That’s according to Jonah Berger, a marketing professor at the Wharton School of the University of Pennsylvania who’s compiled a list of “magic words” that can change the way you communicate. Using the word “because” while trying to convince someone to do something has a compelling result, he tells CNBC Make It: More people will listen to you, and do what you want.

Berger points to a nearly 50-year-old study from Harvard University, wherein researchers sat in a university library and waited for someone to use the copy machine. Then, they walked up and asked to cut in front of the unknowing participant.

They phrased their request in three different ways:

“May I use the Xerox machine?”
“May I use the Xerox machine because I have to make copies?”
“May I use the Xerox machine because I’m in a rush?”
Both requests using “because” made the people already making copies more than 50% more likely to comply, researchers found. Even the second phrasing — which could be reinterpreted as “May I step in front of you to do the same exact thing you’re doing?” — was effective, because it indicated that the stranger asking for a favor was at least being considerate about it, the study suggested.

“Persuasion wasn’t driven by the reason itself,” Berger wrote in a book on the topic, “Magic Words,” which published last year. “It was driven by the power of the word.” . . .

Let’s look into this claim. The first thing I did was click to the study—full credit to CNBC Make It for providing the link—and here’s the data summary from the experiment:

If you look carefully and do some simple calculations, you’ll see that the percentage of participants who complied was 37.5% under treatment 1, 50% under treatment 2, and 62.5% under treatment 3. So, ok, not literally true that both requests using “because” made the people already making copies more than 50% more likely to comply: 0.50/0.375 = 1.33, and increase of 33% is not “more than 50%.” But, sure, it’s a positive result. There were 40 participants in each treatment, so the standard error is approximately 0.5/sqrt(40) = 0.08 for each of those averages. The key difference here is 0.50 – 0.375 = 0.125, that’s the difference between the compliance rates under the treatments “May I use the Xerox machine?” and “May I use the Xerox machine because I have to make copies?”, and this will have a standard error of approximately sqrt(2)*0.08 = 0.11.

The quick summary from this experiment: an observed difference in compliance rates of 12.5 percentage points, with a standard error of 11 percentage points. I don’t want to say “not statistically significant,” so let me just say that the estimate is highly uncertain, so I have no real reason to believe it will replicate.

But wait, you say: the paper was published. Presumably it has a statistically significant p-value somewhere, no? The answer is, yes, they have some “p < .05" results, just not of that particular comparison. Indeed, if you just look at the top rows of that table (Favor = small), then the difference is 0.93 - 0.60 = 0.33 with a standard error of sqrt(0.6*0.4/15 + 0.93*0.07/15) = 0.14, so that particular estimate is just more than two standard errors away from zero. Whew! But now we're getting into forking paths territory: - Noisy data - Small sample - Lots of possible comparisons - Any comparison that's statistically significant will necessarily be huge - Open-ended theoretical structure that could explain just about any result. I'm not saying the researchers were trying to anything wrong. But remember, honesty and transparency are not enuf. Such a study is just too noisy to be useful.

But, sure, back in the 1970s many psychology researchers not named Meehl weren’t aware of these issues. They seem to have been under the impression that if you gather some data and find something statistically significant for which you could come up with a good story, that you’d discovered a general truth.

What’s less excusable is a journalist writing this in the year 2024. But it’s no surprise, conditional on the headline, “Using this 1 word more often can make you 50% more influential, says Harvard study.”

But what about that book by the University of Pennsylvania marketing professor? I searched online, and, fortunately for us, the bit about the Xerox machine is right there in the first chapter, in the excerpt we can read for free. Here it is:

He got it wrong, just like the journalist did! It’s not true that including the meaningless reason increased persuasion just as much as the valid reason did. Look at the data! The outcomes under the three treatment were 37.5%, 50%, and 62.5%. 50% – 37.5% ≠ 62.5% – 37.5%. Ummm, ok, he could’ve said something like, “Among a selected subset of the data with only 15 or 16 people in each treatment, including the meaningless reason increased persuasion just as much as the valid reason did.” But that doesn’t sound so impressive! Even if you add something like, “and it’s possible to come up with a plausible theory to go with this result.”

The book continues:

Given the flaws in the description of the copier study, I’m skeptical about these other claims.

But let me say this. If it is indeed true that using the word “whom” in online dating profiles makes you 31% more likely to get a date, then my advice is . . . don’t use the word “whom”! Think of it from a potential-outcomes perspective. Sure, you want to get a date. But do you really want to go on a date with someone who will only go out with you if you use the word “whom”?? That sounds like a really pretentious person, not a fun date at all!

OK, I haven’t read the rest of the book, and it’s possible that somewhere later on the author says something like, “OK, I was exaggerating a bit on page 4 . . .” I doubt it, but I guess it’s possible.

Replications, anyone?

To return to the topic at hand: In 1978 a study was conducted with 120 participants in a single location. The study was memorable enough to be featured in a business book nearly fifty years later.

Surely the finding has been replicated?

I’d imagine yes; on the other hand, if it had been replicated, this would’ve been mentioned in the book, right? So it’s hard to know.

I did a search, and the article does seem to have been influential:

It’s been cited 1514 times—that’s a lot! Google lists 55 citations in 2023 alone, and in what seem to be legit journals: Human Communication Research, Proceedings of the ACM, Journal of Retailing, Journal of Organizational Behavior, Journal of Applied Psychology, Human Resources Management Review, etc. Not core science journals, exactly, but actual applied fields, with unskeptical mentions such as:

What about replications? I searched on *langer blank chanowitz 1978 replication* and found this paper by Folkes (1985), which reports:

Four studies examined whether verbal behavior is mindful (cognitive) or mindless (automatic). All studies used the experimental paradigm developed by E. J. Langer et al. In Studies 1–3, experimenters approached Ss at copying machines and asked to use it first. Their requests varied in the amount and kind of information given. Study 1 (82 Ss) found less compliance when experimenters gave a controllable reason (“… because I don’t want to wait”) than an uncontrollable reason (“… because I feel really sick”). In Studies 2 and 3 (42 and 96 Ss, respectively) requests for controllable reasons elicited less compliance than requests used in the Langer et al study. Neither study replicated the results of Langer et al. Furthermore, the controllable condition’s lower compliance supports a cognitive approach to social interaction. In Study 4, 69 undergraduates were given instructions intended to increase cognitive processing of the requests, and the pattern of compliance indicated in-depth processing of the request. Results provide evidence for cognitive processing rather than mindlessness in social interaction.

So this study concludes that the result didn’t replicate at all! On the other hand, it’s only a “partial replication,” and indeed they do not use the same conditions and wording as in the original 1978 paper. I don’t know why not, except maybe that exact replications traditionally get no respect.

Langer et al. responded in that journal, writing:

We see nothing in her results [Folkes (1985)] that would lead us to change our position: People are sometimes mindful and sometimes not.

Here they’re referring to the table from the 1978 study, reproduced at the top of this post, which shows a large effect of the “because I have to make copies” treatment under the “Small Favor” condition but no effect under the “Large Favor” condition. Again, given the huge standard errors here, we can’t take any of this seriously, but if you just look at the percentages without considering the uncertainty, then, sure, that’s what they found. Thus, in their response to the partial replication study that did not reproduce their results, Langer et al. emphasized that their original finding was not a main effect but an interaction: “People are sometimes mindful and sometimes not.”

That’s fine. Psychology studies often measure interactions, as they should: the world is a highly variable place.

But, in that case, everyone’s been misinterpreting that 1978 paper! When I say “everybody,” I mean this recent book by the business school professor and also the continuing references to the paper in the recent literature.

Here’s the deal. The message that everyone seems to have learned, or believed they learned, from the 1978 paper is that meaningless explanations are as good as meaningful explanations. But, according to the authors of that paper when they responded to criticism in 1985, the true message is that this trick works sometimes and sometimes not. That’s a much weaker message.

Indeed the study at hand is too small to draw any reliable conclusions about any possible interaction here. The most direct estimate of the interaction effect from the above table is (0.93 – 0.60) – (0.24 – 0.24) = 0.33, with a standard error of sqrt(0.93*0.07/15 + 0.60*0.40/15 + 0.24*0.76/25 + 0.24*0.76/25) = 0.19. So, no, I don’t see much support for the claim in this post from Psychology Today:

So what does this all mean? When the stakes are low people will engage in automatic behavior. If your request is small, follow your request with the word “because” and give a reason—any reason. If the stakes are high, then there could be more resistance, but still not too much.

This happens a lot in unreplicable or unreplicated studies: a result is found under some narrow conditions, and then it is taken to have very general implications. This is just an unusual case where the authors themselves pointed out the issue. As they wrote in their 1985 article:

The larger concern is to understand how mindlessness works, determine its consequences, and specify better the conditions under which it is and is not likely to occur.

That’s a long way from the claim in that business book that “because” is a “magic word.”

Like a lot of magic, it only works under some conditions, and you can’t necessarily specify those conditions ahead of time. It works when it works.

There might be other replication studies of this copy machine study. I guess you couldn’t really do it now, because people don’t spend much time waiting at the copier. But the office copier was a thing for several decades. So maybe there are even some exact replications out there.

In searching for a replication, I did come across this post from 2009 by Mark Liberman that criticized yet another hyping of that 1978 study, this time from a paper by psychologist Daniel Kahenman in the American Economic Review. Kahneman wrote:

Ellen J. Langer et al. (1978) provided a well-known example of what she called “mindless behavior.” In her experiment, a confederate tried to cut in line at a copying machine, using various preset “excuses.” The conclusion was that statements that had the form of an unqualified request were rejected (e.g., “Excuse me, may I use the Xerox machine?”), but almost any statement that had the general form of an explanation was accepted, including “Excuse me, may I use the Xerox machine because I want to make copies?” The superficiality is striking.

As Liberman writes, this represented a “misunderstanding of the 1978 paper’s results, involving both a different conclusion and a strikingly overgeneralized picture of the observed effects.” Liberman performs an analysis of the data from that study which is similar to what I have done above.

Liberman summarizes:

The problem with Prof. Kahneman’s interpretation is not that he took the experiment at face value, ignoring possible flaws of design or interpretation. The problem is that he took a difference in the distribution of behaviors between one group of people and another, and turned it into generic statements about the behavior of people in specified circumstances, as if the behavior were uniform and invariant. The resulting generic statements make strikingly incorrect predictions even about the results of the experiment in question, much less about life in general.

Mindfulness

The key claim of all this research is that people are often mindless: they respond to the form of a request without paying attention to its context, with “because” acting as a “magic word.”

I would argue that this is exactly the sort of mindless behavior being exhibited by the people who are promoting that copying-machine experiment! They are taking various surface aspects of the study and using it to draw large, unsupported conclusions, without being mindful of the details.

In this case, the “magic words” are things like “p < .05," "randomized experiment," "Harvard," "peer review," and "Journal of Personality and Social Psychology" (this notwithstanding). The mindlessness comes from not looking into what exactly was in the paper being cited.

In conclusion . . .

So, yeah, thanks for nothing, Dmitri! Three hours of my life spent going down a rabbit hole. But, hey, if any readers who are single have read far enough down in the post to see my advice not to use “whom” in your data profile, it will all have been worth it.

Seriously, though, the “mindlessness” aspect of this story is interesting. The point here is not, Hey, a 50-year-old paper has some flaws! Or the no-less-surprising observation: Hey, a pop business book exaggerates! The part that fascinates me is that there’s all this shaky research that’s being taken as strong evidence that consumers are mindless—and the people hyping these claims are themselves demonstrating the point by mindlessly following signals without looking into the evidence.

The ultimate advice that the mindfulness gurus are giving is not necessarily so bad. For example, here’s the conclusion of that online article about the business book:

Listen to the specific words other people use, and craft a response that speaks their language. Doing so can help drive an agreement, solution or connection.

“Everything in language we might use over email at the office … [can] provide insight into who they are and what they’re going to do in the future,” says Berger.

That sounds ok. Just forget all the blather about the “magic words” and the “superpowers,” and forget the unsupported and implausible claim that “Arguments, requests and presentations aren’t any more or less convincing when they’re based on solid ideas.” As often is the case, I think these Ted-talk style recommendations would be on more solid ground if they were just presented as the product of common sense and accumulated wisdom, rather than leaning on some 50-year-old psychology study that just can’t bear the weight. But maybe you can’t get the airport book and the Ted talk without a claim of scientific backing.

Don’t get me wrong here. I’m not attributing any malign motivations to any of the people involved in this story (except for Dmitri, I guess). I’m guessing they really believe all this. And I’m not using “mindless” as an insult. We’re all mindless sometimes—that’s the point of the Langer et al. (1978) study; it’s what Herbert Simon called “bounded rationality.” The trick is to recognize your areas of mindlessness. If you come to an area where you’re being mindless, don’t write a book about it! Even if you naively think you’ve discovered a new continent. As Mark Twain apparently never said, it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

The usual disclaimer

I’m not saying the claims made by Langer et al. (1978) are wrong. Maybe it’s true that, under conditions of mindlessness, all that matters is the “because” and any empty explanation will do; maybe the same results would show up in a preregistered replication. All I’m saying is that the noisy data that have been presented don’t provide any strong evidence in support of such claims, and that’s what bothers me about all those confident citations in the business literature.

P.S.

After writing the above post, I sent this response to Dmitri:

OK, I just spent 3 hours on this. I now have to figure out what to do with this after blogging it, because I think there are some important points here. Still, yeah, you did a bad thing by sending this to me. These are 3 hours I could’ve spent doing real work, or relaxing . . .

He replied:

I mean, yeah, that’s too bad for you, obviously. But … try to think about it from my point of view. I am more influential, I got you to work on this while I had a nice relaxing post-Valentine’s day sushi meal with my wife (much easier to get reservations on the 15th and the flowers are a lot cheaper), while you were toiling away on what is essentially my project. I’d say the magic words did their job.

Good point! He exploited my mindlessness. I responded:

Ok, I’ll quote you on that one too! (minus the V-day details).

I’m still chewing on your comment that you appreciate the Beatles for their innovation as much as for their songs. The idea that there are lots of songs of similar quality but not so much innovation, that’s interesting. The only thing is that I don’t know enough about music, even pop music, to have a mental map of where everything fits in. For example, I recently heard that Coldplay song, and it struck me that it was in the style of U2 . But I don’t really know if U2 was the originator of that soaring sound. I guess Pink Floyd is kinda soaring too, but not quite in the same way . . . etc etc … the whole thing was frustrating to me because I had no sense of whether I was entirely bullshitting or not.

So if you can spend 3 hours writing a post on the above topic, we’ll be even.

Dmitri replied:

I am proud of the whole “Valentine’s day on the 15th” trick, so you are welcome to include it. That’s one of our great innovations. After the first 15-20 Valentine’s days, you can just move the date a day later and it is much easier.

And, regarding the music, he wrote:

U2 definitely invented a sound, with the help of their producer Brian Eno.

It is a pretty safe bet that every truly successful musician is an innovator—once you know the sound it is easy enough to emulate. Beethoven, Charlie Parker, the Beatles, all the really important guys invented a forceful, effective new way of thinking about music.

U2 is great, but when I listened to an entire U2 song from beginning to end, it seemed so repetitive as to be unlistenable. I don’t feel that way about the Beatles or REM. But just about any music sounds better to me in the background, which I think is a sign of my musical ignorance and tone-deafness (for real, I’m bad at recognizing pitches) more than anything else. I guess the point is that you’re supposed to dance to it, not just sit there and listen.

Anyway, I warned Dmitri about what would happen if I post his Valentine’s Day trick:

I post this, then it will catch on, and it will no longer work . . . just warning ya! You’ll have to start doing Valentine’s Day on the 16th, then the 17th, . . .

To which Dmitri responded:

Yeah but if we stick with it, it will roll around and we will get back to February 14 while everyone else is celebrating Valentines Day on these weird wrong days!

I’ll leave him with the last word.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Our new book, Active Statistics, is now available!

Coauthored with Aki Vehtari, this new book is lots of fun, perhaps the funnest I’ve ever been involved in writing. And it’s stuffed full of statistical insights. The webpage for the book is here, and the link to buy it is here or directly from the publisher here.

With hundreds of stories, activities, and discussion problems on applied statistics and causal inference, this book is a perfect teaching aid, a perfect adjunct to a self-study program, and an enjoyable bedside read if you already know some statistics.

Here’s the quick summary:

This book provides statistics instructors and students with complete classroom material for a one- or two-semester course on applied regression and causal inference. It is built around 52 stories, 52 class-participation activities, 52 hands-on computer demonstrations, and 52 discussion problems that allow instructors and students to explore the real-world complexity of the subject. The book fosters an engaging “flipped classroom” environment with a focus on visualization and understanding. The book provides instructors with frameworks for self-study or for structuring the course, along with tips for maintaining student engagement at all levels, and practice exam questions to help guide learning. Designed to accompany the authors’ previous textbook Regression and Other Stories, its modular nature and wealth of material allow this book to be adapted to different courses and texts or be used by learners as a hands-on workbook.

It’s got 52 of everything because it’s structured around a two-semester class, with 13 weeks per semester and 2 classes per week. It’s really just bursting with material, including some classic stories and lots of completely new material. Right off the bat we present a statistical mystery that arose with a Wikipedia experiment, and we have a retelling of the famous Literary Digest survey story but with a new and unexpected twist (courtesy of Sharon Lohr and J. Michael Brick). And the activities have so much going on! One of my favorites is a version of the two truths and a lie game that demonstrates several different statistical ideas.

People have asked how this differs from my book with Deborah Nolan, Teaching Statistics: A Bag of Tricks. My quick answer is that Active Statistics has a lot more of everything, it’s structured to cover an entire two-semester course in order, and it’s focused on applied statistics. Including a bunch of stories, activities, demonstrations, and problems on causal inference, a topic that is not always so well integrated into the statistics curriculum. You’re gonna love this book.

You can buy it here or here. It’s only 25 bucks, which is an excellent deal considering how stuffed it is with useful content. Enjoy.

Hey! Here’s some R code to make colored maps using circle sizes proportional to county population.

Kieran Healy shares some code and examples of colored maps where each region is given a circle in proportion to its population. He calls these “Dorling cartograms,” which sounds kinda mysterious to me but I get that there’s no easy phrase to describe them. It’s clear in the pictures, though:

I wrote to Kieran asking if it was possible to make the graphs without solid circles around each point, as that could make them more readable.

He replied:

Yeah it’s easy to do that, you just give different parameters to geom_sf(), specifically you set the linewidth to 0 so no border is drawn on the circles. So instead of geom_sf(color=“gray30”) or whatever you say geom_sf(linewidth=0). But I think this does not in fact make things more readable with a white, off-white, or light gray background:

The circle borders do a fair amount of work to help the eye see where the circles actually are as distinct elements. It’s possible to make the border more subtle and still have it work:

In this version the circle borders are only a *very slightly* darker gray than the background, but it makes a big difference still.

Finally you could also remove the circle borders but make the background very dark, like this:

Not bad, though there issue becomes properly seeing the dark orange— especially smaller counties with very high pct Black. This would work better with one of the other palettes.

Interesting. Another win for ggplot.

How to code and impute income in studies of opinion polls?

Nate Cohn asks:

What’s your preferred way to handle income in a regression when income categories are inconsistent across several combined survey datasets? Am I best off just handling this with multiple categorical variables? Can I safely create a continuous variable?

My reply:

I thought a lot about this issue when writing Red Sate Blue State. My preferred strategy is to use a variable that we could treat as continuous. For example when working with ANES data I was using income categories 1,2,3,4,5 which corresponded to income categories 1-16th percentile, 16-33rd, 34-66th, 67-95th, and 96-100th. If you have different surveys with different categories, you could use some somewhat consistent scaling, for example one survey you might code as 1,3,5,7 and another might be coded as 2,4,6,8. I expect that other people would disagree with this advice but this the sort of thing that I was doing. I’m not so much worried about the scale being imperfect or nonlinear. But if you have a non-monotonic relation, you’ll have to be more careful.

Cohn responds:

Two other thoughts for consideration:

— I am concerned about non-monotonicity. At least in this compilation of 2020 data, the Democrats do best among rich and poor, and sag in the middle. It seems even more extreme when we get into the highest/lowest income strata, ala ANES. I’m not sure this survives controls—it seems like there’s basically no income effect after controls—but I’m hesitant to squelch a possible non-monotonic effect that I haven’t ruled out.

—I’m also curious for your thoughts on a related case. Suppose that (a) dataset includes surveys that sometimes asked about income and sometimes did not ask about income, (b) we’re interested in many demographic covariates, besides income, and; (c) we’d otherwise clearly specify the interaction between income and the other variables. The missing income data creates several challenges. What should we do?

I can imagine some hacky solutions to the NA data problem outright removing observations (say, set all NA income to 1 and interact our continuous income variable with whether we have actual income data), but if we interact other variables with the NA income data there are lots of cases (say, MRP where the population strata specifies income for full pop, not in proportion to survey coverage) where we’d risk losing much of the power gleaned from other surveys about the other demographic covariates. What should we do here?

My quick recommendation is to fit a model with two stages, first predicting income given your other covariates, then predicting your outcome of interest (issue attitude, vote preference, whatever) given income and the other covariates. You can fit the two models simultaneously in one Stan program. I guess then you will want some continuous coding for income (could be something like sqrt(income) with income topcoded at $300K) along with a possibly non-monotonic model at the second level.

Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics

In politics, as in baseball, hot prospects from the minors can have trouble handling big-league pitching.

Right after Sarah Palin was chosen as the Republican nominee for vice president in 2008, my friend Ubs, who grew up in Alaska and follows politics closely, wrote the following:

Palin would probably be a pretty good president. . . . She is fantastically popular. Her percentage approval ratings have reached the 90s. Even now, with a minor nepotism scandal going on, she’s still about 80%. . . . How does one do that? You might get 60% or 70% who are rabidly enthusiastic in their love and support, but you’re also going to get a solid core of opposition who hate you with nearly as much passion. The way you get to 90% is by being boringly competent while remaining inoffensive to people all across the political spectrum.

Ubs gives a long discussion of Alaska’s unique politics and then writes:

Palin’s magic formula for success has been simply to ignore partisan crap and get down to the boring business of fixing up a broken government. . . . It’s not a very exciting answer, but it is, I think, why she gets high approval ratings — because all the Democrats, Libertarians, and centrists appreciate that she’s doing a good job on the boring non-partisan stuff that everyone agrees on and she isn’t pissing them off by doing anything on the partisan stuff where they disagree.

Hey–I bet you never thought you’d see the words “boringly competent,” “inoffensive,” and “Sarah Palin” in the same sentence!

Prediction and extrapolation

OK, so what’s the big deal? Palin got a reputation as a competent nonpartisan governor but when she hit the big stage she shifted to hyper-partisanship. The contrast is interesting to me because it suggests a failure of extrapolation.

Now let’s move to baseball. One of the big findings of baseball statistics guru Bill James is that minor-league statistics, when correctly adjusted, predict major-league performance. James is working through a three-step process: (1) naive trust in minor league stats, (2) a recognition that raw minor league stats are misleading, (3) a statistical adjustment process, by which you realize that there really is a lot of information there, if you know how to use it.

For a political analogy, consider Scott Brown. When he was running for the Senate last year, political scientist Boris Shor analyzed his political ideology. The question was, how would he vote in the Senate if he were elected? Boris wrote:

We have evidence from multiple sources. The Boston Globe, in its editorial endorsing Coakley, called Brown “in the mode of the national GOP.” Liberal bloggers have tried to tie him to the Tea Party movement, making him out to be very conservative. Chuck Schumer called him “far-right.”

In 2002, he filled out a Votesmart survey on his policy positions in the context of running for the State Senate. Looking through the answers doesn’t reveal too much beyond that he is a pro-choice, anti-tax, pro-gun Republican. His interest group ratings are all over the map. . . .

All in all, a very confusing assessment, and quite imprecise. So how do we compare Brown to other state legislators, or more generally to other politicians across the country?

My [Boris’s] research, along with Princeton’s Nolan McCarty, allows us to make precisely these comparisons. Essentially, I use the entirety of state legislative voting records across the country, and I make them comparable by calibrating them through Project Votesmart’s candidate surveys.

By doing so, I can estimate Brown’s ideological score very precisely. It turns out that his score is -0.17, compared with her score of 0.02. Liberals have lower scores; conservatives higher ones. Brown’s score puts him at the 34th percentile of his party in Massachusetts over the 1995-2006 time period. In other words, two thirds of other Massachusetts Republican state legislators were more conservative than he was. This is evidence for my [Boris’s] claim that he’s a liberal even in his own party. What’s remarkable about this is the fact that Massachusetts Republicans are the most, or nearly the most, liberal Republicans in the entire country!

Very Jamesian, wouldn’t you say? And Boris’s was borne out by Scott Brown’s voting record, where he indeed was the most liberal of the Senate’s Republicans.

Political extrapolation

OK, now back to Sarah Palin. First, her popularity. Yes, Gov. Palin was popular, but Alaska is a small (in population) state, and surveys

find that most of the popular governors in the U.S. are in small states. Here are data from 2006 and 2008:

governors.png

There are a number of theories about this pattern; what’s relevant here is that a Bill James-style statistical adjustment might be necessary before taking state-level stats to the national level.

The difference between baseball and politics

There’s something else going on, though. It’s not just that Palin isn’t quite so popular as she appeared at first. There’s also a qualitative shift. From “boringly competent nonpartisan” to . . . well, leaving aside any questions of competence, she’s certainly no longer boring or nonpartisan! In baseball terms, this is like Ozzie Smith coming up from the minors and becoming a Dave Kingman-style slugger. (Please excuse my examples which reveal how long it’s been since I’ve followed baseball!)

So how does baseball differ from politics, in ways that are relevant to statistical forecasting?

1. In baseball there is only one goal: winning. Scoring more runs than the other team. Yes, individual players have other goals: staying healthy, getting paid, not getting traded to Montreal, etc., but overall the different goals are aligned, and playing well will get you all of these to some extent.

But there are two central goals in politics: winning and policy. You want to win elections, but the point of winning is to enact policies that you like. (Sure, there are political hacks who will sell out to the highest bidder, but even these political figures represent some interest groups with goals beyond simply being in office.)

Thus, in baseball we want to predict how a player can help his team win, but in politics we want to predict two things: electoral success and also policy positions.

2. Baseball is all about ability–natural athletic ability, intelligence (as Bill James said, that and speed are the only skills that are used in both offense and defense), and plain old hard work, focus, and concentration. The role of ability in politics is not so clear. In his remarks that started this discussion, Ubs suggested that Palin had the ability and inclination to solve real problems. But it’s not clear how to measure such abilities in a way that would allow any generalization to other political settings.

3. Baseball is the same environment at all levels. The base paths are the same length in the major leagues as in AA ball (at least, I assume that’s true!), the only difference is that in the majors they throw harder. OK, maybe the strike zone and the field dimensions vary, but pretty much it’s the same game.

In politics, though, I dunno. Some aspects of politics really do generalize. The Massachusetts Senate has got to be a lot different from the U.S. Senate, but, in their research, Boris Shor and Nolan McCarty have shown that there’s a lot of consistency in how people vote in these different settings. But I suspect things are a lot different for the executive, where your main task is not just to register positions on issues but to negotiate.

4. In baseball, you’re in or out. If you’re not playing (or coaching), you’re not really part of the story. Sportswriters can yell all they want but who cares. In contrast, politics is full of activists, candidates, and potential candidates. In this sense, the appropriate analogy is not that Sarah Palin started as Ozzie Smith and then became Dave Kingman, but rather a move from being Ozzie Smith to being a radio call-in host, in a world in which media personalities can be as powerful, and as well-paid, as players on the field. Perhaps this could’ve been a good move for, say, Bill Lee, in this alternative universe? A player who can’t quite keep the ball over the plate but is a good talker with a knack for controversy?

Commenter Paul made a good point here:

How many at-bats long is a governorship? The most granular I could imagine possibly talking is a quarter. At the term level we’d be doing better making each “at-bat” independent of the previous. 20 or so at-bats don’t have much predictive value either. Even over a full 500 at-bat season, fans try to figure out whether a big jump in BABIP is a sign of better bat control or luck.

The same issues arise at very low at-bat counts too. If you bat in front of a slugger, you can sit on pitches in the zone. If you’ve got a weakness against a certain pitching style, you might not happen to see it. And once the ball is in the air, luck is a huge factor in if it travels to a fielder or between them.

I suspect if we could somehow get a political candidate to hold 300-400 different political jobs in different states, with different party goals and support, we’d be able to do a good job predicting future job performance, even jumping from state to national levels. But the day to day successes of a governor are highly correlative.

Indeed, when it comes to policy positions, a politician has lots of “plate appearances,” that is, opportunities to vote in the legislature. But when it comes to elections, a politician will only have at most a couple dozen in his or her entire career.

All the above is from a post from 2011. I thought about it after this recent exchange with Mark Palko regarding the political candidacy of Ron DeSantis.

In addition to everything above, let me add one more difference between baseball and politics. In baseball, the situation is essentially fixed, and pretty much all that matters is player ability. In contrast, in politics, the most important factor is the situation. In general elections in the U.S., the candidate doesn’t matter that much. (Primaries are a different story.) In summary, to distinguish baseball players in ability we have lots of data to estimate a big signal; to distinguish politicians in vote-getting ability we have very little data to estimate a small signal.

The four principles of Barnard College: Respect, empathy, kindness . . . and censorship?

A few months ago we had Uh oh Barnard . . .

And now there’s more:

Barnard is mandating that students remove any items affixed to room or suite doors by Feb. 28, after which point the college will begin removing any remaining items, Barnard College Dean Leslie Grinage announced in a Friday email to the Barnard community. . . .

“We know that you have been hearing often lately about our community rules and policies. And we know it may feel like a lot,” Grinage wrote. “The goal is to be as clear as possible about the guardrails, and, meeting the current moment, do what we can to support and foster the respect, empathy and kindness that must guide all of our behavior on campus.”

According to the student newspaper, here’s the full email from the Barnard dean:

Dear Residential Students,

The residential experience is an integral part of the Barnard education. Our small campus is a home away from home for most of you, and we rely on each other to help foster an environment where everyone feels welcome and safe. This is especially important in our residential spaces. We encourage debate and discussion and the free exchange of ideas, while upholding our commitment to treating one another with respect, consideration and kindness. In that spirit, I’m writing to remind you of the guardrails that guide our residential community — our Residential Life and Housing Student Guide.

While many decorations and fixtures on doors serve as a means of helpful communication amongst peers, we are also aware that some may have the unintended effect of isolating those who have different views and beliefs. So, we are asking everyone to remove any items affixed to your room and/or suite doors (e.g. dry-erase boards, decorations, messaging) by Wednesday, February 28 at noon; the College will remove any remaining items starting Thursday, February 29. The only permissible items on doors are official items placed by the College (e.g. resident name tags). (Those requesting an exemption for religious or other reasons should contact Residential Life and Housing by emailing [email protected].)

We know that you have been hearing often lately about our community rules and policies. And we know it may feel like a lot. The goal is to be as clear as possible about the guardrails, and, meeting the current moment, do what we can to support and foster the respect, empathy and kindness that must guide all of our behavior on campus.

The Residential Life and Housing team is always here to support you, and you should feel free to reach out to them with any questions you may have.

Please take care of yourselves and of each other. Together we can build an even stronger Barnard community.

Sincerely,

Leslie Grinage

Vice President for Campus Life and Student Experience and Dean of the College

The dean’s letter links to this Residential Life and Housing Student Guide, which I took a look at. It’s pretty reasonable, actually. All I saw regarding doors was this mild restriction:

While students are encouraged to personalize their living space, they may not alter the physical space of the room, drill or nail holes into any surface, or affix tapestries and similar decorations to the ceiling, light fixtures, or doorways. Painting any part of the living space or college-supplied furniture is also prohibited.

The only thing in the entire document that seemed objectionable was the no-sleeping-in-the-lounges policy, but I can’t imagine they would enforce that rule unless someone was really abusing the privilege. They’re not gonna send the campus police to wake up a napper.

So, yeah, they had a perfectly reasonable rulebook and then decided to mess it all up by not letting the students decorate their doors. So much for New York, center of free expression.

I assume what’s going on here is that Barnard wants to avoid the bad publicity that comes from clashes between groups of students with opposing political views. And now they’re getting bad publicity because they’re censoring students’ political expression.

The endgame seems to be to turn the college to some sort of centrally-controlled corporate office park. But that wouldn’t be fair. In a corporate office, they let you decorate your own cubicle, right?

ISBA 2024 Satellite Meeting: Lugano, 25–28 June

Antonietta Mira is organizing a satellite workshop before ISBA. It’s free, there is still time to submit a poster, and it’s a great excuse to visit Lugano. Here are the details:

I really like small meetings like this. Mitzi and I are going to be there and then continue on to ISBA.