Conformal prediction and people

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes. 

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution. 

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.  

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments  doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction. 

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially. 

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Defining optimal reliance on model predictions in AI-assisted decisions

This is Jessica. In a previous post I mentioned methodological problems with studies of AI-assisted decision-making, such as are used to evaluate different model explanation strategies. The typical study set-up gives people some decision task (e.g., Given the features of this defendant, decide whether to convict or release), has them make their decision, then gives them access to a model’s prediction, and observes if they change their mind. Studying this kind of AI-assisted decision task is of interest as organizations deploy predictive models to assist human decision-making in domains like medicine and criminal justice. Ideally, the human is able to use the model to improve the performance they’d get on their own or if the model was deployed without a human in the loop (referred to as complementarity). 

The most frequently used definition of appropriate reliance is that if the person goes with the model prediction but it’s wrong, this is overreliance. If they don’t go with the model prediction but it’s right, this is labeled underreliance. Otherwise it is labeled appropriate reliance. 

This definition is problematic for several reasons. One is that the AI might have a higher probability than the human of selecting the right action, but still end up being wrong. It doesn’t make sense to say the human made the wrong choice by following it in such cases. Because it’s based on post-hoc correctness, this approach confounds two sources of non-optimal human behavior: not accurately estimating the probability that the AI is correct versus not making the right choice of whether to go with the AI or not given one’s beliefs. 

By scoring decisions in action space, it also equally penalizes not choosing the right action (which prediction to go with) in a scenario where the human and the AI have very similar probabilities of being correct and one where either the AI or human has a much higher probability of being correct. Nevertheless, there are many papers doing it this way, some with hundreds of citations. 

In A Statistical Framework for Measuring AI Reliance, Ziyang Guo, Yifan Wu, Jason Hartline and I write: 

Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s prediction from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational agent facing the same decision task as the behavioral agents.

It’s a similar approach to our rational agent framework for data visualization, but here we assume a setup in which the decision-maker receives a signal consisting of the feature values for some instance, the AI’s prediction, the human’s prediction, and optionally some explanation of the AI decision. The decision-maker chooses which prediction to go with. 

We can compute the upper bound or best attainable performance in such a study (rational benchmark) as the expected score of a rational decision-maker on a randomly drawn decision task. The rational decision-maker has prior knowledge of the data generating model (the joint distribution over the signal and ground truth state). Seeing the instance in a decision trial, they accurately perceive the signal, arrive at Bayesian posterior beliefs about the distribution of the payoff-relevant state, then choose the action that maximizes their expected utility over the posterior. We calculate this in payoff space as defined by the scoring rule, such that the cost of an error can vary in magnitude. 

We can define the “value of rational complementation” for the decision-problem at hand by also defining the rational agent baseline: the expected performance of the rational decision-maker without access to the signal on a randomly chosen decision task from the experiment. Because it represents the score the rational agent would get if they could rely only on their prior beliefs about the data-generating model, the baseline is the expected score of a fixed strategy that always chooses the better of the human alone or the AI alone.

figure showing human alone, then ai alone, then (with ample room between them) rational benchmark

If designing or interpreting an experiment on AI reliance, the first thing we might want to do is look at how close to the benchmark the baseline is. We want to see a decent amount of room for the human-AI team to improve performance over the baseline, as in the image above. If the baseline is very close to the benchmark, it’s probably not worth adding the human.

Once we have run an experiment and observed how well people make these decisions, we can treat the value of complementation as a comparative unit for interpreting how much value adding the human contributes over making the decision with the baseline. We do this by normalizing the observed score within the range where the rational agent baseline is 0 and the rational agent benchmark is 1 and looking at where the observed human+AI performance lies. This also provides a useful sense of effect size when we are comparing different settings. For example, if we have two model explanation strategies A and B we compared in an experiment, we can calculate expected human performance on a randomly drawn decision trial under A and under B and measure the improvement by calculating (score_A − score_B)/value of complementation. 

figure showing human alone, then ai alone, then human+ai, then rational benchmark

We can also decompose sources of error in study participants’ performance. To do this, we define a “mis-reliant” rational decision-maker benchmark, which is the expected score of a rational agent constrained to the reliance level that we observe in study participants. Hence this is the best score a decision-maker who relies on the AI the same overall proportion of the time could attain had they perfectly perceived the probability that the AI is correct relative to the probability that the human is correct on every decision task. Since the mis-reliant benchmark and the study participants have the same reliance level (i.e., they both accept the AI’s prediction the same percentage of the time), the difference in their decisions lies entirely in accepting the AI predictions at different instances. The mis-reliant rational decision-maker always accepts the top X% AI predictions ranked by performance advantage over human predictions, but study participants may not.

figure showing human alone, ai alone, human+ai, misreliant rational benchmark, rational benchmark

By calculating the mis-reliant rational benchmark for the observed reliance level of study participants, we can distinguish between reliance loss, the loss from over- or under-relying on the AI (defined as the difference between the rational benchmark and mis-reliant benchmark divided by the value of rational complementation), and discrimination loss, the loss from not accurately differentiating the instances where the AI is better than the human from the ones where the human is better than the AI (defined as the difference between the mis-reliant benchmark and the expected score of participants divided by the value of rational complementation). 

We apply this approach to some well-known studies on AI reliance and can extend the original interpretations to varying degrees, ranging from observing a lack of potential to see complementarity in the study given how much better the AI was going in than the human, to the original interpretation missing that participants’ reliance levels were pretty close to rational and they just couldn’t distinguish which signals they should go with AI on. We also observe researchers making comparisons across conditions for which the upper and lower bounds on performance differ without accounting for the difference.

There’s more in the paper – for example, we discuss how the rational benchmark, our upper bound representing the expected score of a rational decision-maker on a randomly chosen decision task, may be overfit to the empirical data. This occurs when the signal space is very large (e.g., the instance is a text document) such that we observe very few human predictions per signal. We describe how the rational agent could determine the best response on the optimal coarsening of the empirical distribution, such that the true rational benchmark is bounded by this and the overfit upper bound. 

While we focused on showing how to improve research on human-AI teams, I’m excited about the potential for this framework to help organizations as they consider whether deploying an AI is likely to improve some human decision process. We are currently thinking about what sorts of practical questions (beyond Could pairing a human and AI be effective here?) we can answer using such a framework.

Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile)

This is a long post, so let me give you the tl;dr right away: Don’t use the word “whom” in your dating profile.

OK, now for the story. Fasten your seat belts, it’s going to be a bumpy night.

It all started with this message from Dmitri with subject line, “Man I hate to do this to you but …”, which continued:

How could I resist?

https://www.cnbc.com/2024/02/15/using-this-word-can-make-you-more-influential-harvard-study.html

I’m sorry, let me try again … I had to send this to you BECAUSE this is the kind of obvious shit you like to write about. I like how they didn’t even do their own crappy study they just resurrected one from the distant past.

OK, ok, you don’t need to shout about it!

Following the link we see this breathless press release NBC news story:

Using this 1 word more often can make you 50% more influential, says Harvard study

Sometimes, it takes a single word — like “because” — to change someone’s mind.

That’s according to Jonah Berger, a marketing professor at the Wharton School of the University of Pennsylvania who’s compiled a list of “magic words” that can change the way you communicate. Using the word “because” while trying to convince someone to do something has a compelling result, he tells CNBC Make It: More people will listen to you, and do what you want.

Berger points to a nearly 50-year-old study from Harvard University, wherein researchers sat in a university library and waited for someone to use the copy machine. Then, they walked up and asked to cut in front of the unknowing participant.

They phrased their request in three different ways:

“May I use the Xerox machine?”
“May I use the Xerox machine because I have to make copies?”
“May I use the Xerox machine because I’m in a rush?”
Both requests using “because” made the people already making copies more than 50% more likely to comply, researchers found. Even the second phrasing — which could be reinterpreted as “May I step in front of you to do the same exact thing you’re doing?” — was effective, because it indicated that the stranger asking for a favor was at least being considerate about it, the study suggested.

“Persuasion wasn’t driven by the reason itself,” Berger wrote in a book on the topic, “Magic Words,” which published last year. “It was driven by the power of the word.” . . .

Let’s look into this claim. The first thing I did was click to the study—full credit to CNBC Make It for providing the link—and here’s the data summary from the experiment:

If you look carefully and do some simple calculations, you’ll see that the percentage of participants who complied was 37.5% under treatment 1, 50% under treatment 2, and 62.5% under treatment 3. So, ok, not literally true that both requests using “because” made the people already making copies more than 50% more likely to comply: 0.50/0.375 = 1.33, and increase of 33% is not “more than 50%.” But, sure, it’s a positive result. There were 40 participants in each treatment, so the standard error is approximately 0.5/sqrt(40) = 0.08 for each of those averages. The key difference here is 0.50 – 0.375 = 0.125, that’s the difference between the compliance rates under the treatments “May I use the Xerox machine?” and “May I use the Xerox machine because I have to make copies?”, and this will have a standard error of approximately sqrt(2)*0.08 = 0.11.

The quick summary from this experiment: an observed difference in compliance rates of 12.5 percentage points, with a standard error of 11 percentage points. I don’t want to say “not statistically significant,” so let me just say that the estimate is highly uncertain, so I have no real reason to believe it will replicate.

But wait, you say: the paper was published. Presumably it has a statistically significant p-value somewhere, no? The answer is, yes, they have some “p < .05" results, just not of that particular comparison. Indeed, if you just look at the top rows of that table (Favor = small), then the difference is 0.93 - 0.60 = 0.33 with a standard error of sqrt(0.6*0.4/15 + 0.93*0.07/15) = 0.14, so that particular estimate is just more than two standard errors away from zero. Whew! But now we're getting into forking paths territory: - Noisy data - Small sample - Lots of possible comparisons - Any comparison that's statistically significant will necessarily be huge - Open-ended theoretical structure that could explain just about any result. I'm not saying the researchers were trying to anything wrong. But remember, honesty and transparency are not enuf. Such a study is just too noisy to be useful.

But, sure, back in the 1970s many psychology researchers not named Meehl weren’t aware of these issues. They seem to have been under the impression that if you gather some data and find something statistically significant for which you could come up with a good story, that you’d discovered a general truth.

What’s less excusable is a journalist writing this in the year 2024. But it’s no surprise, conditional on the headline, “Using this 1 word more often can make you 50% more influential, says Harvard study.”

But what about that book by the University of Pennsylvania marketing professor? I searched online, and, fortunately for us, the bit about the Xerox machine is right there in the first chapter, in the excerpt we can read for free. Here it is:

He got it wrong, just like the journalist did! It’s not true that including the meaningless reason increased persuasion just as much as the valid reason did. Look at the data! The outcomes under the three treatment were 37.5%, 50%, and 62.5%. 50% – 37.5% ≠ 62.5% – 37.5%. Ummm, ok, he could’ve said something like, “Among a selected subset of the data with only 15 or 16 people in each treatment, including the meaningless reason increased persuasion just as much as the valid reason did.” But that doesn’t sound so impressive! Even if you add something like, “and it’s possible to come up with a plausible theory to go with this result.”

The book continues:

Given the flaws in the description of the copier study, I’m skeptical about these other claims.

But let me say this. If it is indeed true that using the word “whom” in online dating profiles makes you 31% more likely to get a date, then my advice is . . . don’t use the word “whom”! Think of it from a potential-outcomes perspective. Sure, you want to get a date. But do you really want to go on a date with someone who will only go out with you if you use the word “whom”?? That sounds like a really pretentious person, not a fun date at all!

OK, I haven’t read the rest of the book, and it’s possible that somewhere later on the author says something like, “OK, I was exaggerating a bit on page 4 . . .” I doubt it, but I guess it’s possible.

Replications, anyone?

To return to the topic at hand: In 1978 a study was conducted with 120 participants in a single location. The study was memorable enough to be featured in a business book nearly fifty years later.

Surely the finding has been replicated?

I’d imagine yes; on the other hand, if it had been replicated, this would’ve been mentioned in the book, right? So it’s hard to know.

I did a search, and the article does seem to have been influential:

It’s been cited 1514 times—that’s a lot! Google lists 55 citations in 2023 alone, and in what seem to be legit journals: Human Communication Research, Proceedings of the ACM, Journal of Retailing, Journal of Organizational Behavior, Journal of Applied Psychology, Human Resources Management Review, etc. Not core science journals, exactly, but actual applied fields, with unskeptical mentions such as:

What about replications? I searched on *langer blank chanowitz 1978 replication* and found this paper by Folkes (1985), which reports:

Four studies examined whether verbal behavior is mindful (cognitive) or mindless (automatic). All studies used the experimental paradigm developed by E. J. Langer et al. In Studies 1–3, experimenters approached Ss at copying machines and asked to use it first. Their requests varied in the amount and kind of information given. Study 1 (82 Ss) found less compliance when experimenters gave a controllable reason (“… because I don’t want to wait”) than an uncontrollable reason (“… because I feel really sick”). In Studies 2 and 3 (42 and 96 Ss, respectively) requests for controllable reasons elicited less compliance than requests used in the Langer et al study. Neither study replicated the results of Langer et al. Furthermore, the controllable condition’s lower compliance supports a cognitive approach to social interaction. In Study 4, 69 undergraduates were given instructions intended to increase cognitive processing of the requests, and the pattern of compliance indicated in-depth processing of the request. Results provide evidence for cognitive processing rather than mindlessness in social interaction.

So this study concludes that the result didn’t replicate at all! On the other hand, it’s only a “partial replication,” and indeed they do not use the same conditions and wording as in the original 1978 paper. I don’t know why not, except maybe that exact replications traditionally get no respect.

Langer et al. responded in that journal, writing:

We see nothing in her results [Folkes (1985)] that would lead us to change our position: People are sometimes mindful and sometimes not.

Here they’re referring to the table from the 1978 study, reproduced at the top of this post, which shows a large effect of the “because I have to make copies” treatment under the “Small Favor” condition but no effect under the “Large Favor” condition. Again, given the huge standard errors here, we can’t take any of this seriously, but if you just look at the percentages without considering the uncertainty, then, sure, that’s what they found. Thus, in their response to the partial replication study that did not reproduce their results, Langer et al. emphasized that their original finding was not a main effect but an interaction: “People are sometimes mindful and sometimes not.”

That’s fine. Psychology studies often measure interactions, as they should: the world is a highly variable place.

But, in that case, everyone’s been misinterpreting that 1978 paper! When I say “everybody,” I mean this recent book by the business school professor and also the continuing references to the paper in the recent literature.

Here’s the deal. The message that everyone seems to have learned, or believed they learned, from the 1978 paper is that meaningless explanations are as good as meaningful explanations. But, according to the authors of that paper when they responded to criticism in 1985, the true message is that this trick works sometimes and sometimes not. That’s a much weaker message.

Indeed the study at hand is too small to draw any reliable conclusions about any possible interaction here. The most direct estimate of the interaction effect from the above table is (0.93 – 0.60) – (0.24 – 0.24) = 0.33, with a standard error of sqrt(0.93*0.07/15 + 0.60*0.40/15 + 0.24*0.76/25 + 0.24*0.76/25) = 0.19. So, no, I don’t see much support for the claim in this post from Psychology Today:

So what does this all mean? When the stakes are low people will engage in automatic behavior. If your request is small, follow your request with the word “because” and give a reason—any reason. If the stakes are high, then there could be more resistance, but still not too much.

This happens a lot in unreplicable or unreplicated studies: a result is found under some narrow conditions, and then it is taken to have very general implications. This is just an unusual case where the authors themselves pointed out the issue. As they wrote in their 1985 article:

The larger concern is to understand how mindlessness works, determine its consequences, and specify better the conditions under which it is and is not likely to occur.

That’s a long way from the claim in that business book that “because” is a “magic word.”

Like a lot of magic, it only works under some conditions, and you can’t necessarily specify those conditions ahead of time. It works when it works.

There might be other replication studies of this copy machine study. I guess you couldn’t really do it now, because people don’t spend much time waiting at the copier. But the office copier was a thing for several decades. So maybe there are even some exact replications out there.

In searching for a replication, I did come across this post from 2009 by Mark Liberman that criticized yet another hyping of that 1978 study, this time from a paper by psychologist Daniel Kahenman in the American Economic Review. Kahneman wrote:

Ellen J. Langer et al. (1978) provided a well-known example of what she called “mindless behavior.” In her experiment, a confederate tried to cut in line at a copying machine, using various preset “excuses.” The conclusion was that statements that had the form of an unqualified request were rejected (e.g., “Excuse me, may I use the Xerox machine?”), but almost any statement that had the general form of an explanation was accepted, including “Excuse me, may I use the Xerox machine because I want to make copies?” The superficiality is striking.

As Liberman writes, this represented a “misunderstanding of the 1978 paper’s results, involving both a different conclusion and a strikingly overgeneralized picture of the observed effects.” Liberman performs an analysis of the data from that study which is similar to what I have done above.

Liberman summarizes:

The problem with Prof. Kahneman’s interpretation is not that he took the experiment at face value, ignoring possible flaws of design or interpretation. The problem is that he took a difference in the distribution of behaviors between one group of people and another, and turned it into generic statements about the behavior of people in specified circumstances, as if the behavior were uniform and invariant. The resulting generic statements make strikingly incorrect predictions even about the results of the experiment in question, much less about life in general.

Mindfulness

The key claim of all this research is that people are often mindless: they respond to the form of a request without paying attention to its context, with “because” acting as a “magic word.”

I would argue that this is exactly the sort of mindless behavior being exhibited by the people who are promoting that copying-machine experiment! They are taking various surface aspects of the study and using it to draw large, unsupported conclusions, without being mindful of the details.

In this case, the “magic words” are things like “p < .05," "randomized experiment," "Harvard," "peer review," and "Journal of Personality and Social Psychology" (this notwithstanding). The mindlessness comes from not looking into what exactly was in the paper being cited.

In conclusion . . .

So, yeah, thanks for nothing, Dmitri! Three hours of my life spent going down a rabbit hole. But, hey, if any readers who are single have read far enough down in the post to see my advice not to use “whom” in your data profile, it will all have been worth it.

Seriously, though, the “mindlessness” aspect of this story is interesting. The point here is not, Hey, a 50-year-old paper has some flaws! Or the no-less-surprising observation: Hey, a pop business book exaggerates! The part that fascinates me is that there’s all this shaky research that’s being taken as strong evidence that consumers are mindless—and the people hyping these claims are themselves demonstrating the point by mindlessly following signals without looking into the evidence.

The ultimate advice that the mindfulness gurus are giving is not necessarily so bad. For example, here’s the conclusion of that online article about the business book:

Listen to the specific words other people use, and craft a response that speaks their language. Doing so can help drive an agreement, solution or connection.

“Everything in language we might use over email at the office … [can] provide insight into who they are and what they’re going to do in the future,” says Berger.

That sounds ok. Just forget all the blather about the “magic words” and the “superpowers,” and forget the unsupported and implausible claim that “Arguments, requests and presentations aren’t any more or less convincing when they’re based on solid ideas.” As often is the case, I think these Ted-talk style recommendations would be on more solid ground if they were just presented as the product of common sense and accumulated wisdom, rather than leaning on some 50-year-old psychology study that just can’t bear the weight. But maybe you can’t get the airport book and the Ted talk without a claim of scientific backing.

Don’t get me wrong here. I’m not attributing any malign motivations to any of the people involved in this story (except for Dmitri, I guess). I’m guessing they really believe all this. And I’m not using “mindless” as an insult. We’re all mindless sometimes—that’s the point of the Langer et al. (1978) study; it’s what Herbert Simon called “bounded rationality.” The trick is to recognize your areas of mindlessness. If you come to an area where you’re being mindless, don’t write a book about it! Even if you naively think you’ve discovered a new continent. As Mark Twain apparently never said, it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

The usual disclaimer

I’m not saying the claims made by Langer et al. (1978) are wrong. Maybe it’s true that, under conditions of mindlessness, all that matters is the “because” and any empty explanation will do; maybe the same results would show up in a preregistered replication. All I’m saying is that the noisy data that have been presented don’t provide any strong evidence in support of such claims, and that’s what bothers me about all those confident citations in the business literature.

P.S.

After writing the above post, I sent this response to Dmitri:

OK, I just spent 3 hours on this. I now have to figure out what to do with this after blogging it, because I think there are some important points here. Still, yeah, you did a bad thing by sending this to me. These are 3 hours I could’ve spent doing real work, or relaxing . . .

He replied:

I mean, yeah, that’s too bad for you, obviously. But … try to think about it from my point of view. I am more influential, I got you to work on this while I had a nice relaxing post-Valentine’s day sushi meal with my wife (much easier to get reservations on the 15th and the flowers are a lot cheaper), while you were toiling away on what is essentially my project. I’d say the magic words did their job.

Good point! He exploited my mindlessness. I responded:

Ok, I’ll quote you on that one too! (minus the V-day details).

I’m still chewing on your comment that you appreciate the Beatles for their innovation as much as for their songs. The idea that there are lots of songs of similar quality but not so much innovation, that’s interesting. The only thing is that I don’t know enough about music, even pop music, to have a mental map of where everything fits in. For example, I recently heard that Coldplay song, and it struck me that it was in the style of U2 . But I don’t really know if U2 was the originator of that soaring sound. I guess Pink Floyd is kinda soaring too, but not quite in the same way . . . etc etc … the whole thing was frustrating to me because I had no sense of whether I was entirely bullshitting or not.

So if you can spend 3 hours writing a post on the above topic, we’ll be even.

Dmitri replied:

I am proud of the whole “Valentine’s day on the 15th” trick, so you are welcome to include it. That’s one of our great innovations. After the first 15-20 Valentine’s days, you can just move the date a day later and it is much easier.

And, regarding the music, he wrote:

U2 definitely invented a sound, with the help of their producer Brian Eno.

It is a pretty safe bet that every truly successful musician is an innovator—once you know the sound it is easy enough to emulate. Beethoven, Charlie Parker, the Beatles, all the really important guys invented a forceful, effective new way of thinking about music.

U2 is great, but when I listened to an entire U2 song from beginning to end, it seemed so repetitive as to be unlistenable. I don’t feel that way about the Beatles or REM. But just about any music sounds better to me in the background, which I think is a sign of my musical ignorance and tone-deafness (for real, I’m bad at recognizing pitches) more than anything else. I guess the point is that you’re supposed to dance to it, not just sit there and listen.

Anyway, I warned Dmitri about what would happen if I post his Valentine’s Day trick:

I post this, then it will catch on, and it will no longer work . . . just warning ya! You’ll have to start doing Valentine’s Day on the 16th, then the 17th, . . .

To which Dmitri responded:

Yeah but if we stick with it, it will roll around and we will get back to February 14 while everyone else is celebrating Valentines Day on these weird wrong days!

I’ll leave him with the last word.

When do we expect conformal prediction sets to be helpful? 

This is Jessica. Over on substack, Ben Recht has been posing some questions about the value of prediction bands with marginal guarantees, such as one gets from conformal prediction. It’s an interesting discussion that caught my attention since I have also been musing about where conformal prediction might be helpful. 

To briefly review, given a training data set (X1, Y1), … ,(Xn, Yn), and a test point (Xn+1, Yn+1) drawn from the same distribution, conformal prediction returns a subset of the label space for which we can make coverage guarantees about the probability of containing the test point’s true label Yn+1. A prediction set Cn achieves distribution-free marginal coverage at level 1 − alpha when P(Yn+1 ∈ Cn(Xn+1)) >= 1 − alpha for all joint distributions P on (X, Y). The commonly used split conformal prediction process attains this by adding a couple of steps to the typical modeling workflow: you first split the data into a training and calibration set, fitting the model on the training set. You choose a heuristic notion of uncertainty from the trained model, such as the softmax values–pseudo-probabilities from the last layer of a neural network–to create a score function s(x,y) that encodes disagreement between x and y (in a regression setting these are just the residuals). You compute q_hat, the ((n+1)(1-alpha))/n quantile of the scores on the calibration set. Then given a new instance x_n+1, you construct a prediction set for y_n+1 by including all y’s for which the score is less than or equal to q_hat. There are various ways to achieve slightly better performance, such as using cumulative summed scores and regularization instead.

Recht makes several good points about limitations of conformal prediction, including:

—The marginal coverage guarantees are often not very useful. Instead we want conditional coverage guarantees that hold conditional on the value of Xn+1 we observe. But you can’t get true conditional coverage guarantees (i.e., P(Yn+1 ∈ Cn(Xn+1)|Xn+1 = x) >= 1 − alpha for all P and almost all x) if you also want the approach to be distribution free (see e.g., here), and in general you need a very large calibration set to be able to say with high confidence that there is a high probability that your specific interval contains the true Yn+1.

—When we talk about needing prediction bands for decisions, we are often talking about scenarios where the decisions we want to make from the uncertainty quantification are going to change the distribution and violate the exchangeability criterion. 

—Additionally, in many of the settings where we might imagine using prediction sets there is potential for recourse. If the prediction is bad, resulting in a bad action being chosen, the action can be corrected, i.e., “If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong.”

Recht also criticizes research on conformal prediction as being fixated on the ability to make guarantees, irrespective of how useful the resulting intervals are. E.g., we can produce sets with 95% coverage with only two points, and the guarantees are always about coverage instead of the width of the interval.

These are valid points, worth discussing given how much attention conformal prediction has gotten lately. Some of the concerns remind me of the same complaints we often hear about traditional confidence intervals we put on parameter estimates, where the guarantees we get (about the method) are also generally not what we want (about the interval itself) and only actually summarize our uncertainty when the assumptions we made in inference are all good, which we usually can’t verify. A conformal prediction interval is about uncertainty in a model’s prediction on a specific instance, which perhaps makes it more misleading to some people given that it might not be conditional on anything specific to the instance. Still, even if the guarantees don’t stand as stated, I think it’s difficult to rule out an approach without evaluating how it gets used. Given that no method ever really quantifies all of our uncertainty, or even all of the important sources of uncertainty, the “meaning” of an uncertainty quantification really depends on its use, and what the alternatives considered in a given situation are. So I guess I disagree that one can answer the question “Can conformal prediction achieve the uncertainty quantification we need for decision-making?” without considering the specific decision at hand, how we are constructing the prediction set exactly (since there are ways to condition the guarantees on some instance-specific information), and how it would be made without a prediction set. 

There are various scenarios where prediction sets are used without a human in the loop, like to get better predictions or directly calibrate decisions, where it seems hard to argue that it’s not adding value over not incorporating any uncertainty quantification. Conformal prediction for alignment purposes (e.g., control the factuality or toxicity of LLM outputs) seems to be on the rise. However I want to focus here on a scenario where we are directly presenting a human with the sets. One type of setting where I’m curious whether conformal prediction sets could be useful are those where 1) models are trained offline and used to inform people’s decisions, and 2) it’s hard to rigorously quantify the uncertainty in the predictions using anything the model produces internally, like softmax values which can be overfit to the training sample.

For example, a doctor needs to diagnose a skin condition and has access to a deep neural net trained on images of skin conditions for which the illness has been confirmed. Even if this model appears to be more accurate than the doctor on evaluation data, the hospital may not be comfortable deploying the model in place of the doctor. Maybe the doctor has access to additional patient information that may in some cases allow them to make a better prediction, e.g., because they can decide when to seek more information through further interaction or monitoring of the patient. This means the distribution does change upon acting on the prediction, and I think Recht would say there is potential for recourse here, since the doctor can revise the treatment plan over time (he lists a similar example here). But still, at any given point in time, there’s a model and there’s a decision to be made by a human.    

Giving the doctor information about the model’s confidence in its prediction seems like it should be useful in helping them appraise the prediction in light of their own knowledge. Similarly, giving them a prediction set over a single top-1 prediction seems potentially preferable so they don’t anchor too heavily on a single prediction. Deep neural nets for medical diagnoses can do better than many humans in certain domains while still having relatively low top-1 accuracy (e.g., here). 

A naive thing to do would be to just choose some number k of predictions from the model we think a doctor can handle seeing at once, and show the top-k with softmax scores. But an adaptive conformal prediction set seems like an improvement in that at least you get some kind of guarantee, even if it’s not specific to your interval like you want. Set size conveys information about the level of uncertainty like the width of a traditional confidence interval does, which seems more likely to be helpful for conveying relative uncertainty than holding set size constant and letting the coverage guarantee change (I’ve heard from at least one colleague who works extensively with doctors that many are pretty comfortable with confidence intervals). We can also take steps toward the conditional coverage that we actually want by using an algorithm that calibrates the guarantees over different classes (labels), or that achieves a relaxed version of conditional coverage, possibilities that Recht seems to overlook. 

So while I agree with all the limitations, I don’t necessarily agree with Recht’s concluding sentence here:

“If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong. If you can, point predictions coupled with subsequent action are enough to achieve nearly optimal decisions.” 

It seems possible that seeing a prediction set (rather than just a single top prediction) will encourage a doctor to consider other diagnoses that they may not have thought of. Presenting uncertainty often has _some_ effect on a person’s reasoning process, even if they can revise their behavior later. The effect of seeing more alternatives could be bad in some cases (they get distracted by labels that don’t apply), or it could be good (a hurried doctor recognizes a potentially relevant condition they might have otherwise overlooked). If we allow for the possibility that seeing a set of alternatives helps, it makes sense to have a way to generate them that give us some kind of coverage guarantee we can make sense of, even if it gets violated sometimes. 

This doesn’t mean I’m not skeptical of how much prediction sets might change things over more naively constructed sets of possible labels. I’ve spent a bit of time thinking about how, from the human perspective, prediction sets could or could not add value, and I suspect its going to be nuanced, and the real value probably depends on how the coverage responds under realistic changes in distribution. There are lots of questions that seem worth trying to answer in particular domains where models are being deployed to assist decisions. Does it actually matter in practice, such as in a given medical decision setting, for the quality of decisions that are made if the decision-makers are given a set of predictions with coverage guarantees versus a top-k display without any guarantees? And, what happens when you give someone a prediction set with some guarantee but there are distribution shifts such that the guarantees you give are not quite right? Are they still better off with the prediction set or is this worse than just providing the model’s top prediction or top-k with no guarantees? Again, many of the questions could also be asked of other uncertainty quantification approaches; conformal prediction is just easier to implement in many cases. I have more to say on some of these questions based on a recent study we did on decisions from prediction sets, where we compared how accurately people labeled images using them versus other displays of model predictions, but I’ll save that for another post since this is already long. 

Of course, it’s possible that in many settings we would be better using some inherently interpretable model for which we no longer need a distribution-free approach. And ultimately we might be better off if we can better understand the decision problem the human decision-maker faces and apply decision theory to try to find better strategies  rather than leaving it up to the human how to combine their knowledge with what they get from a model prediction. I think we still barely understand how this occurs even in high stakes settings that people often talk about.

Uncertainty in games: How to get that balance so that there’s a motivation to play well, but you can still have a chance to come back from behind?

I just read the short book, “Uncertainty in games,” by Greg Costikyan. It was interesting. His main point, which makes sense to me, is that uncertainty a key part of the appeal of any game. He gives interesting examples of different sources of uncertainty. For example, if you’re playing a video game such as Pong, the uncertainty is in your own reflexes and reactions. With Diplomacy, there’s uncertainty in what the other players will do. With poker, there’s uncertainty about all the hole cards. With chess, there’s uncertainty in what the other player will do and also uncertainty in the logical implications of any position, in the same way that I am uncertain about what is the 200th digit of the decimal expansion of pi, even though that number exists. I agree with Costikyan that uncertainty is a helpful concept for thinking about games.

There’s one thing he didn’t discuss in his book, though, that I wanted to hear more about, and that’s the way that time and uncertainty interact in games, and how this factors into game design. I’ve been thinking a lot about time lately, and this is another example, especially relevant to me as we’re in the process of finishing up the design of a board game, and we want to improve its playability.

To fix ideas, consider a multi-player tabletop game with a single winner, and suppose the game takes somewhere between a half hour and two hours to play. As a player, I want to have a real chance of winning, until close to the end, and when the game reaches the point at which I pretty much know I can’t lose, I still want it to be fun, I want some intermediate goal such as the possibility of being a spoiler, or of being able to capitalize on my opponents’ mistakes. At the same time, I don’t want the outcome to be entirely random.

Consider two extremes:
1. One player gets ahead early and then can relentlessly exploit the advantage to get a certain win.
2. Nobody is ever ahead by much; there’s a very equal balance, and the winner is decided only at the very end by some random event.

Option #1 actually isn’t so bad—as long as the player in the lead can compound the advantage and force the win quickly. For example, in chess, if you have a decisive lead you can use your pieces together to increase your advantage. This is to be distinguished from how we played as kids, which was that once you’re in the lead you’d just try to trade pieces until the opposing player had nothing left: that got pretty boring. If you can use your pieces together, the game is more interesting even during the period where the winning player is clinching it.

Option #2 would not be so much fun. Sure, sometimes you will have a close game that’s decided at the very end, and that’s fine, but I’d like for victory to be some reflection of cumulative game play, as otherwise it’s meaningless.

Sometimes this isn’t so important. In Scrabble, for example, the play itself is enjoyable. The competition can also be good—it’s fun to be in a tight game where you’re counting the letters, blocking out the other player, and strategizing to get that final word on the board—but even if you’re way behind, you can still try to get the most out of your rack.

In some other games, though, once you’re behind and you don’t have a chance to win, it’s just a chore to keep playing. Monopoly and Risk handle this by creating a positive incentive for players to wipe out weak opponents, so that once you’re down, you’ll soon be out.

And yet another approach is to have cumulative scoring. In poker it’s all about the money. Whether you’re ahead or behind for the night, you’re still motivated to improve your bankroll.

One thing I don’t have a good grip on regarding game design is how to get that balance between all these possibilities, so that how you play matters throughout the game, while at the same time keeping the possibility of winning for as long as is feasibly possible.

I remember my dad saying that he preferred tennis scoring (each game is played to 4 points, each set is 6 games, you need to win 2 or 3 sets) as compared to old-style ping-pong scoring (whoever reaches 21 points first, wins), because in tennis, even if you’re way behind, you always have a chance to come back. Which makes sense, and is related to Costikyan’s point about uncertainty, but is hard for me to formalize.

A key idea here, I think, is that the relative skill of the players during the course of a match is a nonstationary process. For example, if player A is winning, perhaps up 2 sets to 0 and up 5 games to 2 in the third set, but then player B comes from behind to catch up and then maybe win in the fifth set, yes, this is an instance of uncertainty in action, but it won’t be happening at random. What will happen is that A gets tired, or B figures out a new plan of action, or some other factor that affects the relative balance of skill. And that itself is part of the game.

In summary, we’d like the game to balance three aspects:

1. Some positive feedback mechanism so that when you’re ahead you can use this advantage to increase your lead.

2. Some responsiveness to changes in effort and skill during the game, so that by pushing really hard or coming up with a clever new strategy you can come back from behind.

3. Uncertainty, as emphasized by Costikyan.

I’m sure that game designers have thought systematically about such things; I just don’t know where to look.

Clinical trials that are designed to fail

Mark Palko points us to a recent update by Robert Yeh et al. of the famous randomized parachute-jumping trial:

Palko writes:

I also love the way they dot all the i’s and cross all the t’s. The whole thing is played absolutely straight.

I recently came across another (not meant as satire) study where the raw data was complete crap but the authors had this ridiculously detailed methods section, as if throwing in a graduate level stats course worth of terminology would somehow spin this shitty straw into gold.

Yeh et al. conclude:

This reminded me of my zombies paper. I forwarded the discussion to Kaiser Fung, who wrote:

Another recent example from Covid is this Scottish study. They did so much to the data that it is impossible for any reader to judge whether they did the right things or not. The data are all locked down for “privacy.”

Getting back to the original topic, Joseph Delaney had some thoughts:

I think the parachute study makes a good and widely misunderstood point. Our randomized controlled trial infrastructure is designed for the drug development world, where there is a huge (literally life altering) benefit to proving the efficacy of a new agent. Conservative errors are being cautious and nobody seriously considers a trial designed to fail as a plausible scenario.

But you see new issues with trials designed to find side effects (e.g., RECORD has a lot more LTFU than I saw in a drug study, when I did trials we studied how to improve adherence to improve the results—but a trial looking for side effects that cost the company money would do the reverse). We teach in pharmacy that conservative design is actually a problem in safety trials.

Even worse are trials which are aliased with a political agenda. It’s easy-peasy to design a trial to fail (the parachute trial was jumping from a height of 2 feet). That makes me a lot more critical when you see trials where the failure of the trial would be seen as a upside, because it is just so easy to botch a trial. Designing good trials is very hard (smarter people than I spend entire careers doing a handful of them). It’s a tough issue.

Lots to chew on here.

If school funding doesn’t really matter, why do people want their kid’s school to be well funded?

A question came up about the effects of school funding and student performance, and we were referred to this review article from a few years ago by Larry Hedges, Terri Pigott, Joshua Polanin, Ann Marie Ryan, Charles Tocci, and Ryan Williams:

One question posed continually over the past century of education research is to what extent school resources affect student outcomes. From the turn of the century to the present, a diverse set of actors, including politicians, physicians, and researchers from a number of disciplines, have studied whether and how money that is provided for schools translates into increased student achievement. The authors discuss the historical origins of the question of whether school resources relate to student achievement, and report the results of a meta-analysis of studies examining that relationship. They find that policymakers, researchers, and other stakeholders have addressed this question using diverse strategies. The way the question is asked, and the methods used to answer it, is shaped by history, as well by the scholarly, social, and political concerns of any given time. The diversity of methods has resulted in a body of literature too diverse and too inconsistent to yield reliable inferences through meta-analysis. The authors suggest that a collaborative approach addressing the question from a variety of disciplinary and practice perspectives may lead to more effective interventions to meet the needs of all students.

I haven’t followed this literature carefully. It was my vague impression that studies have found effects of schools on students’ test scores to be small. So, not clear that improving schools will do very much. On the other hand, everyone wants their kid to go to a good school. Just for example, all the people who go around saying that school funding doesn’t matter, they don’t ask to reduce the funding of their own kids’ schools. And I teach at an expensive school myself. So lots of pieces here, hard for me to put together.

I asked education statistics expert Beth Tipton what she thought, and she wrote:

I think the effect of money depends upon the educational context. For example, in higher education at selective universities, the selection process itself is what ensures success of students – the school matters far less. But in K-12, and particularly in under resourced areas, schools and finances can matter a lot – thus the focus on charter schools in urban locales.

I guess the problem here is that I’m acting like the typical uninformed consumer of research. The world is complicated, and any literature will be a mess, full of claims and counter-claims, but here I am expecting there to be a simple coherent story that I can summarize in a short sentence (“Schools matter” or “Schools don’t matter” or, maybe, “Schools matter but only a little”).

Given how frustrated I get when others come into a topic with this attitude, I guess it’s good for me to recognize when I do it.

“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates.

This one’s important.

Matt Lerner points us to this report by Rosie Bettle, Replicability & Generalisability: A Guide to CEA discounts.

“CEA” is cost-effectiveness analysis, and by “discounts” they mean what we’ve called the Edlin factor—“discount” is a better name than factor, because it’s a number that should be between 0 and 1, it’s what you should multiply a point estimate by to adjust for inevitable upward biases in reported effect-size estimates, issues discussed here and here, for example.

It’s pleasant to see some of my ideas being used for a practical purpose. I would just add that type M and type S errors should be lower for Bayesian inferences than for raw inferences that have not been partially pooled toward a reasonable prior model.

Also, regarding empirical estimation of adjustment factors, I recommend looking at the work of Erik van Zwet et al; here are some links:
What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?
How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”
The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments
Erik van Zwet explains the Shrinkage Trilogy
The significance filter, the winner’s curse and the need to shrink
Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”
Explaining that line, “Bayesians moving from defense to offense”

I’m excited about the application of these ideas to policy analysis.

Minimum criteria for studies evaluating human decision-making

This is Jessica. A while back on the blog I shared some opinions about studies of human-decision making, such as to understand how visualizations or displays of model predictions and explanations impact people’s behavior. My view is essentially that a lot of the experiments being used to do things like rank interfaces or model explanation techniques are not producing very informative results because the decision task is defined too loosely. 

I decided to write up some thoughts rather than only blogging them. In Decision Theoretic Foundations for Human Decision Experiments (with Alex Kale and Jason Hartline), we write: 

Decision-making with information displays is a key focus of research in areas like explainable AI, human-AI teaming, and data visualization. However, what constitutes a decision problem, and what is required for an experiment to be capable of concluding that human decisions are flawed in some way, remain open to speculation. We present a widely applicable definition of a decision problem synthesized from statistical decision theory and information economics. We argue that to attribute loss in human performance to forms of bias, an experiment must provide participants with the information that a rational agent would need to identify the normative decision. We evaluate the extent to which recent evaluations of decision-making from the literature on AI-assisted decisions achieve this criteria. We find that only 6 (17%) of 35 studies that claim to identify biased behavior present participants with sufficient information to characterize their behavior as deviating from good decision-making. We motivate the value of studying well-defined decision problems by describing a characterization of performance losses they allow us to conceive. In contrast, the ambiguities of a poorly communicated decision problem preclude normative interpretation. 

We make a couple main points. First, if you want to evaluate human decision-making from some sort of information interface, you should be able to formulate the task you are studying as a decision problem as defined by statistical decision theory and information economics. Specifically, a decision problem consists of a payoff-relevant state, a data-generating model which produces signals that induce a distribution over the state, an action space from which the decision-maker chooses a response, and a scoring rule that defines the quality of the decision as a function of the action that was chosen and the realization of the payoff-relevant state. Using this definition of a decision problem gives you a statistically coherent way to define the normative decision, i.e., the action that a Bayesian agent would choose to maximize their utility under whatever scoring rule you’ve set up. In short, if you want to say anything based on your results that implies people’s decisions are flawed, you need to make clear what is optimal, and you’re not going to do better than statistical decision theory. 

The second requirement is that you communicate to the study participants sufficient information for a rational agent to know how to optimize: select the optimal action after forming posterior beliefs about the state of the world given whatever signals –visualizations, displays of model predictions, etc–you are showing them. 

When these criteria are met you gain the ability to conceive of different sources of performance loss implied by the process that the rational Bayesian decision-maker goes through when faced with the decision problem: 

  • Prior loss, the loss in performance due to the difference between the agent’s prior beliefs and those used by the researchers to calculate the normative standard.
  • Receiver loss, the loss due to the agent not properly extracting the information from the signal, for example, because the human visual system constrains what information is actually perceived or because participants can’t figure out how to read the signal.
  • Updating loss, the loss due to the agent not updating their prior beliefs according to Bayes rule with the information they obtained from the signal (in cases where the signal does not provide sufficient information about the posterior probability on its own).
  • Optimization loss, the loss in performance due to not identifying the optimal action under the scoring rule. 

Complicating things is loss due to the possibility that the agent misunderstands the decision task, e.g., because they didn’t really internalize the scoring rule. So any hypothesis you might try to test about one of the sources of loss above is actually testing the joint hypothesis consisting of your hypothesis plus the hypothesis that participants understood the task. We don’t get into how to estimate these losses, but some of our other work does, and there’s lots more to explore there. 

If you communicate to your study participants part of a decision problem, but leave out some important component, you should expect their lack of clarity about the problem to induce heterogeneity in the behaviors they exhibit. And then you can’t distinguish such “heterogeneity by design” from the real differences between decision-quality based on the differences between the conditions that you are trying to study. You don’t know if participants are making flawed decisions because of real challenges with forming accurate beliefs or selecting the right action under different types of signals or because they are operating under a different version of the decision problem than you have in mind.

Here’s a picture that comes to mind:

Diagram showing underspecified decision problem being interpreted differently by people  

 

I.e., each participant might have a unique way of filling in the details about the problem that you’ve failed to communicate, which differs from how you analyze it. Often I think experimenters are overly optimistic about how easy it is to move from the left side–the artificial world of the experiment–to draw conclusions about the right. I think sometimes people believe that if they leave out some information (e.g.,  they don’t communicate to participants the prior probability of recidivating in a study on recidivism prediction, or they set up a fictional voting scenario but don’t give participants a clear scoring rule when studying effects of different election forecast displays), they are “being more realistic”, because in the real world people rely on their own intuitions and past experience so there are lots of possible influences on how a person makes their decision. But, as we write in the paper, this is a mistake, because people will generally have different goals and beliefs in an experiment than they do in the real world. Even if everyone is influenced in the experiment by a different factor that does operate in the real world, the idea that the composition of all these interpretations gives us a good approximation of real world behavior is not supported, as we say in the paper it “arises from a failure to recognize our fundamental uncertainty about how the experimental context relates to the real world.“ We can’t know for sure how good a simulacrum our experimental context is for the real world task, so we should at least be very clear about what the experimental context is so we can draw internally valid conclusions. 

Criteria 1 is often met in visualization and human-centered AI, but Criteria 2 is not

I don’t think these two criteria are met in most of the interface decision experiments I come across. In fact, as the abstract mentions, Alex and I looked at a sample of 46 papers on AI assisted decision-making that a survey previously labeled as evaluating human decisions; of these 11 were interested in studying tasks for which you can’t define ground truth, like emotional responses people had to recommendations, or had a descriptive purpose, like estimating how accurately a group of people can guess the post-release criminal status of a set of defendants in the COMPAS dataset. Of the remaining 35, only a handful gave participants enough information for them to at least in theory know how to best respond to the problem. And even when sufficient information to solve the decision problem in theory is given, often the authors use a different scoring rule to evaluate the results than they gave to participants. The problem here is that you are assigning a different meaning to the same responses when you evaluate versus when you instruct participants. There were also many instances of information asymmetries between conditions the researchers compared, like where some of the prediction displays contained less decision-relevant information or some of the conditions got feedback after each decision while others didn’t. Interpreting the results is easier if the authors account for the difference in expected performance based on giving people a slightly different problem. 

In part the idea of writing this up was that it could provide a kind of explainer of the philosophy behind work we’ve done recently that defines rational agent benchmarks for different types of decision studies. As I’ve said before, I would love to see people studying interfaces adopt statistical decision theory more explicitly. However, we’ve encountered resistance in some cases. One reason I suspect is because people don’t understand the assumptions made in decision theory, so this is an attempt to walk through things step by step to build confidence. Though there may be other reasons too, related to people distrusting anything that claims to be “rational.”  

“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?”

Alexey Guzey asks:

How much have you thought about AI and when will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?

My first reply: I guess that AI can already do better science than Matthew “Sleeplord” Walker, Brian “Pizzagate” Wansink, Marc “Schoolmarm” Hauser, or Satoshi “Freakonomics” Kanazawa. So some humans are already obsolete, when it comes to producing science.

OK, let me think a bit more. I guess it depends on what kind of scientific research we’re talking about. Lots of research can be automated, and I could easily imagine an AI that can do routine analysis of A/B tests better than a human could. Indeed, thinking of how the AI could do this is a good way to improve how humans currently do things.

For bigger-picture research, I don’t see AI doing much. But a big problem now with human research is that human researchers want to take routine research and promote it as big-picture (see Walker, Wansink, Kanazawa, etc.). I guess that an AI could be programmed to do hype and create Ted talk scripts.

Guzey’s response:

What’s “routine research”? Would someone without a college degree be able to do it? Is routine research simply defined as such that can be done by a computer now?

My reply: I guess the computer couldn’t really do the research, as it that would require filling test tubes or whatever. I’m thinking that the computer could set up the parameters of an experiment, evaluate measurements, choose sample size, write up the analysis, etc. It would have to be some computer program that someone writes. If you just fed the scientific literature into a chatbot, I guess you’d just get millions more crap papers, basically reproducing much of what is bad about the literature now, which is the creation of articles that give the appearance of originality and relevance while actually being empty in content.

But, now that I’m writing this, I think Guzey is asking something slightly different: he wants to know when a general purpose “scientist” computer could be written, kind of like a Roomba or a self-driving car, but instead of driving around, it would read the literature, perform some sort of sophisticated meta-analyses, and come up with research ideas, like “Run an experiment on 500 people testing manipulations A and B, measure pre-treatment variables U and V, and look at outcomes X and Y.” I guess the first step would be to try to build such a system in a narrow environment such as testing certain compounds that are intended to kill bacteria or whatever.

I don’t know. On one hand, even the narrow version of this problem sounds really hard; on the other hand, our standards for publishable research are so low that it doesn’t seem like it would be so difficult to write a computer program that can fake it.

Maybe the most promising area of computer-designed research would be in designing new algorithms, because there the computer could actually perform the experiment; no laboratory or test tubes required, so the experiments can be run automatically and the computer could try millions of different things.

Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024)

Here’s the link:

Learning from mistakes

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We learn so much from mistakes! How can we structure our workflow so that we can learn from mistakes more effectively? I will discuss a bunch of examples where I have learned from mistakes, including data problems, coding mishaps, errors in mathematics, and conceptual errors in theory and applications. I will also discuss situations where researchers have avoided good learning opportunities. We can then try to use all these cases to develop some general understanding of how and when we learn from errors in the context of the fractal nature of scientific revolutions.

The video is here.

It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.

To put it another way . . . when you find out you made a mistake, you learn three things:

1. Now: Your original statement was wrong.

2. Implications for the future: Beliefs and actions that flow from that original statement may be wrong. You should investigate your reasoning going forward and adjust to account for your error.

3. Implications for the past: Something in your existing workflow led to your error. You should trace your workflow, see how that happened, and alter your workflow accordingly.

In poker, they say to evaluate the strategy, not the play. In quality control, they say to evaluate the process, not the individual outcome. Similarly with workflow.

As we’ve discussed many many times in this space (for example, here), it makes me want to screeeeeeeeeeam when people forego this opportunity to learn. Why do people, sometimes very accomplished people, give up this opportunity? I’m speaking here of people who are trying their best, not hacks and self-promoters.

The simple answer for why even honest people will avoid admitting clear mistakes is that it’s embarrassing for them to admit error, they don’t want to lose face.

The longer answer, I’m afraid, is that at some level they recognize issues 1, 2, and 3 above, and they go to some effort to avoid confronting item 1 because they really really don’t want to face item 2 (their beliefs and actions might be affected, and they don’t want to hear that!) and item 3 (they might be going about everything all wrong, and they don’t want to hear that either!).

So, paradoxically, the very benefits of learning from error are scary enough to some people that they’ll deny or bury their own mistakes. Again, I’m speaking here of otherwise-sincere people, not of people who are willing to lie to protect their investment or make some political point or whatever.

In my talk, I’ll focus on my own mistakes, not those of others. My goal is for you in the audience to learn how to improve your own workflow so you can catch errors faster and learn more from them, in all three senses listed above.

P.S. Planning a talk can be good for my research workflow. I’ll get invited to speak somewhere, then I’ll write a title and abstract that seems like it should work for that audience, then the existence of this structure gives me a chance to think about what to say. For example, I’d never quite thought of the three ways of learning from error until writing this post, which in turn was motivated by the talk coming up. I like this framework. I’m not claiming it’s new—I guess it’s in Pólya somewhere—, just that it will help my workflow. Here’s another recent example of how the act of preparing an abstract helped me think about a topic of continuing interest to me.

Regarding the use of “common sense” when evaluating research claims

I’ve often appealed to “common sense” or “face validity” when considering unusual research claims. For example, the statement that single women during certain times of the month were 20 percentage points more likely to support Barack Obama, or the claim that losing an election for governor increases politicians’ lifespan by 5-10 years on average, or the claim that a subliminal smiley face flashed on a computer screen causes large changes in people’s attitudes on immigration, or the claim that attractive parents are 36% more likely to have girl babies . . . these claims violated common sense. Or, to put it another way, they violated my general understanding of voting, health, political attitudes, and human reproduction.

I often appeal to common sense, but that doesn’t mean that I think common sense is always correct or that we should defer to common sense. Rather, common sense represents some approximation of a prior distribution or existing model of the world. When our inferences contradict our expectations, that is noteworthy (in a chapter 6 of BDA sort of way), and we want to address this. It could be that addressing this will result in a revision of “common sense.” That’s fine, but if we do decide that our common sense was mistaken, I think we should make that statement explicitly. What bothers me is when people report findings that contradict common sense and don’t address the revision in understanding that would be required to accept that.

In each of the above-cited examples (all discussed at various times on this blog), there was a much more convincing alternative explanation for the claimed results, given some mixture of statistical errors and selection bias (p-hacking or forking paths). That’s not to say the claims are wrong (Who knows?? All things are possible!), but it does tell us that we don’t need to abandon our prior understanding of these things. If we want to abandon our earlier common-sense views, that would be a choice to be made, an affirmative statement that those earlier views are held so weakly that they can be toppled by little if any statistical evidence.

P.S. Perhaps relevant is this recent article by Mark Whiting and Duncan Watts, “A framework for quantifying individual and collective common sense.”

Progress in 2023, Jessica Edition

Since Aki and Andrew are doing it… 

Published:

Unpublished/Preprints:

Performed:

If I had to choose a favorite (beyond the play, of course) it would be the rational agent benchmark paper, discussed here. But I also really like the causal quartets paper. The first aims to increase what we learn from experiments in empirical visualization and HCI through comparison to decision-theoretic benchmarks. The second aims to get people to think twice about what they’ve learned from an average treatment effect. Both have influenced what I’ve worked on since.

This post is not really about Aristotle.

I just read this magazine article by Nikhil Krishnan on the philosophy of Aristotle. As a former physics student, I’ve never had anything but disdain for that ancient philosopher, who’s famous for getting just about everything in physics wrong, as well as notoriously claiming that men had more teeth than women (or maybe it was the other way around). Dude just liked to make confident pronouncements. He also thought slavery was cool.

I guess that if Aristotle were around today, he’d be a long-form blogger.

That all said, Krishnan’s article was absolutely wonderful, and I can’t wait to read his forthcoming book. And I say this even though this article didn’t convince me one bit that Aristotle is worth reading! It did convince me that Krishnan is worth reading, which I guess is more relevant right now.

I’m not gonna go through and summarize the article here—it’s short, and you can follow the above link and read it online—; instead, I’ll just point out some thoughts that it inspired in me. It’s the sign of a good work of literature that it makes us reflect.

1. Krishnan discusses AITA. One thing I’d never thought about before is the framing in terms of the asshole, the idea that in any dispute there is exactly one asshole. No room for honest misunderstandings or, from the other direction, a battle between two assholes. This kind of implies that the “asshole” trait is a property of the interaction, so that an unresolvable or unresolved conflict even between two wonderful caring people can inevitably lead to one of them becoming “the asshole.”

2. Krishnan writes:

[Aristotle] says, for instance, that people in politics who identify flourishing with honor can’t be right, for honor “seems to depend more on those who honor than on the one honored.” This has been dubbed the “Coriolanus paradox”: seekers of honor “tend to defeat themselves by making themselves dependent on those to whom they aim to be superior,” as Bernard Williams notes.

I’ve never heard of Bernard Williams, but I have heard of Coriolanus, and what really struck me about that bit is how much it reminded me of a someone I know who absolutely loves honors—he lusts after them, I’d say, to the extent that this lust has warped his life. Nothing wrong with honors as long as they don’t do that distortion. But that point about honor-seekers defeating themselves by making themselves dependent on those to whom they aim to be superior . . . That hits the nail on the head for this guy, his frustration at not being recognized by his perceived inferiors. I’d never thought about that angle before. (Please don’t try to guess this person’s name in the comments; that’s not the point here.)

3. Krishnan writes that his college instructor gave this recommendation: “How to read Aristotle? Slowly. This reminds me of the principle that God is in every leaf of every tree. In short: Sure, read Aristotle very slowly and you’ll learn a lot, in the same way that if you put any bug under a microscope and look at it carefully, or if you sit in front of a painting for a few hours and really force yourself to stare, or if you hold a long conversation with anyone, you’ll learn a lot.

4. Krishnan writes:

There is such a thing as the difference between right and wrong. But reliably telling them apart takes experience, the company of wise friends, and the good luck of having been well brought up.

Well put. It also helps not to be hungry or in pain.

What to trust in the newspaper? Example of “The Simple Nudge That Raised Median Donations by 80%”

Greg Mayer points to this news article, “The Simple Nudge That Raised Median Donations by 80%,” which states:

A start-up used the Hebrew word “chai” and its numerical match, 18, to bump up giving amounts. . . . It’s a common donation amount among Jews — $18, $180, $1,800 or even $36 and other multiples.

So Daffy lowered its minimum gift to $18 and then went further, prompting any donor giving to any Jewish charity to bump gifts up by some related amount. Within a year, median gifts had risen to $180 from $100. . . .

I see several warning signs here:

1. “Within a year, median gifts had risen to $180 from $100.” This is a before/after change, not a direct comparison of outcomes.

2. No report, just a quoted number which could easily have been made up. Yes, the numbers in a report can be fabricated too, but that takes more work and is more risk. Making up numbers when talking with a reporter, that’s easy.

3. The people who report the number are motivated to claim success; the reporter is motivated to report a success. The article is filled with promotion for this company. It’s a short article that mentions “Daffy” 6 times in the short article, for example this bit which reads like a straight-up ad:

If you have children, grandchildren, nieces or nephews, there’s another possibility. Daffy has a family plan that allows children to prompt their adult relatives to support a cause the children choose. Why not put the app on their iPhones or iPads so they can make suggestions and let, for example, a 12-year-old make $12 donations to 12 nonprofits each year?

Why not, indeed? Even better, why not have them make their donations directly to Daffy and cut out the middleman?? Look, I’m not saying that the people behind Daffy are doing anything wrong; it’s just that this is public relations, not journalism.

4. Use of the word “nudge” in the headline is consistent with business-press hype. Recall that “nudge” is a subfield whose proponents are well connected in the media and routinely make exaggerated claims.

So, yeah, an observational comparison with no documentation, in an article that’s more like an advertisement, that’s kinda sus. Not that the claim is definitely wrong, there’s just no good reason for us to take it seriously.

Here’s a sad post for you to start the new year. The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals.

How horrible. I remember when The Onion started. They were so funny and on point. And now . . . What’s the point of even having The Onion if it’s running plagiarized material? I mean, yeah, sure, everybody’s gotta bring home money to put food on the table. But, really, what’s the goddam point of it all?

Jonathan Bailey has the story:

Back in June, G/O Media, the company that owns A.V. Club, Gizmodo, Quarts and The Onion, announced that they would be experimenting with AI tools as a way to supplement the work of human reporters and editors.

However, just a week later, it was clear that the move wasn’t going smoothly. . . . several months later, it doesn’t appear that things have improved. If anything, they might have gotten worse.

The reason is highlighted in a report by Frank Landymore and Jon Christian at Futurism. They compared the output of A.V. Club’s AI “reporter” against the source material, namely IMDB. What they found were examples of verbatim and near-verbatim copying of that material, without any indication that the text was copied. . . .

The articles in question have a note that reads as follows: “This article is based on data from IMDb. Text was compiled by an AI engine that was then reviewed and edited by the editorial staff.”

However, as noted by the Futurism report, that text does not indicate that any text is copied. Only that “data” is used. The text is supposed to be “compiled” by the AI and then “reviewed and edited” by humans. . . .

In both A.V. Club lists, there is no additional text or framing beyond the movies and the descriptions, which are all based on IMDb descriptions and, as seen in this case, sometimes copied directly or nearly directly from them.

There’s not much doubt that this is plagiarism. Though A.V. Club acknowledges that the “data” came from IMDb, it doesn’t indicate that the language does. There are no quotation marks, no blockquotes, nothing to indicate that portions are copied verbatim or near-verbatim. . . .

Bailey continues:

None of this is a secret. All of this is well known, well-understood and backed up with both hard data and mountains of anecdotal evidence. . . . But we’ve seen this before. Benny Johnson, for example, is an irredeemably unethical reporter with a history of plagiarism, fabrication and other ethical issues that resulted in him being fired from multiple publications.

Yet, he’s never been left wanting for a job. Publications know that, because of his name, he will draw clicks and engagement. . . . From a business perspective, AI is not very different from Benny Johnson. Though the flaws and integrity issues are well known, the allure of a free reporter who can generate countless articles at the push of a button is simply too great to ignore.

Then comes the economic argument:

But in there lies the problem, if you want AI to function like an actual reporter, it has to be edited, fact checked and plagiarism checked just like a real human.

However, when one does those checks, the errors quickly become apparent and fixing them often takes more time and resources than just starting with a human author.

In short, using an AI in a way that helps a company earn/save money means accepting that the factual errors and plagiarism are just part of the deal. It means completely forgoing journalism ethics, just like hiring a reporter like Benny Johnson.

Right now, for a publication, there is no ethical use of AI that is not either unprofitable or extremely limited. These “experiments” in AI are not about testing what the bots can do, but about seeing how much they can still lower their ethical and quality standards and still find an audience.

Ouch.

Very sad to see an Onion-affiliated site doing this.

Here’s how Bailey concludes:

The arc of history has been pulling publications toward larger quantities of lower quality content for some time. AI is just the latest escalation in that trend, and one that publishers are unlikely to ignore.

Even if it destroys their credibility.

No kidding. What next, mathematics professors who copy stories unacknowledged, introduce errors, and then deny they ever did it? Award-winning statistics professors who copy stuff from wikipedia, introducing stupid-ass errors in the process? University presidents? OK, none of those cases were shocking, they’re just sad. But to see The Onion involved . . . that truly is a step further into the abyss.

Judgments versus decisions

This is Jessica. A paper called “Decoupling Judgment and Decision Making: A Tale of Two Tails” by Oral, Dragicevic, Telea, and Dimara showed up in my feed the other day. The premise of the paper is that when people interact with some data visualization, their accuracy in making judgments might conflict with their accuracy in making decisions from the visualization. Given that the authors appear to be basing the premise in part on results from a prior paper on decision making from uncertainty visualizations I did with Alex Kale and Matt Kay, I took a look. Here’s the abstract:

Is it true that if citizens understand hurricane probabilities, they will make more rational decisions for evacuation? Finding answers to such questions is not straightforward in the literature because the terms “judgment” and “decision making” are often used interchangeably. This terminology conflation leads to a lack of clarity on whether people make suboptimal decisions because of inaccurate judgments of information conveyed in visualizations or because they use alternative yet currently unknown heuristics. To decouple judgment from decision making, we review relevant concepts from the literature and present two preregistered experiments (N=601) to investigate if the task (judgment vs. decision making), the scenario (sports vs. humanitarian), and the visualization (quantile dotplots, density plots, probability bars) affect accuracy. While experiment 1 was inconclusive, we found evidence for a difference in experiment 2. Contrary to our expectations and previous research, which found decisions less accurate than their direct-equivalent judgments, our results pointed in the opposite direction. Our findings further revealed that decisions were less vulnerable to status-quo bias, suggesting decision makers may disfavor responses associated with inaction. We also found that both scenario and visualization types can influence people’s judgments and decisions. Although effect sizes are not large and results should be interpreted carefully, we conclude that judgments cannot be safely used as proxy tasks for decision making, and discuss implications for visualization research and beyond. Materials and preregistrations are available at https://osf.io/ufzp5/?view only=adc0f78a23804c31bf7fdd9385cb264f. 

There’s a lot being said here, but they seem to be getting at a difference between forming accurate beliefs from some information and making a good (e.g., utility optimal) decision. I would agree there are slightly different processes. But they are also claiming to have a way of directly comparing judgment accuracy to decision accuracy. While I appreciate the attempt to clarify terms that are often overloaded, I’m skeptical that we can meaningfully separate and compare judgments from decisions in an experiment. 

Some background

Let’s start with what we found in our 2020 paper, since Oral et al base some of their questions and their own study setup on it. In that experiment we’d had people make incentivized decisions from displays that varied only how they visualized the decision-relevant probability distributions. Each one showed a distribution of expected scores in a fantasy sports game for a team with and without the addition of a new player. Participants had to decide whether to pay for the new player or not in light of the cost of adding the player, the expected score improvement, and the amount of additional monetary award they won when they scored above a certain number of points. We also elicited a (controversial) probability of superiority judgment: What do you think is the probability your team will score more points with the new player than without? In designing the experiment we held various aspects of the decision problem constant so that only the ground truth probability of superiority was varying between trials. So we talked about the probability judgment as corresponding to the decision task.

However, after modeling the results we found that depending on whether we analyzed results from the probability response question or the incentivized decision, the ranking of visualizations changed. At the time we didn’t have a good explanation for this disparity between what was helpful for doing the probability judgment versus the decision, other than maybe it was due to the probability judgment not being directly incentivized like the decision response was. But in a follow-up analysis that applied a rational agent analysis framework to this same study, allowing us to separate different sources of performance loss by calibrating the participants’ responses for the probability task, we saw that people were getting most of the decision-relevant information regardless of which question they were responding to; they just struggled to report it for the probability question. So we concluded that the most likely reason for the disparity between judgment and decision results was probably that the probability of superiority judgment was not the most intuitive judgment to be eliciting – if we really wanted to elicit the beliefs directly corresponding to the incentivized decision task, we should have asked them for the difference in the probability of scoring enough points to win the award with and without the new player. But this is still just speculation, since we still wouldn’t be able to say in such a setup how much the results were impacted by only one of the responses being incentivized. 

Oral et al. gloss over this nuance, interpreting our results as finding “decisions less accurate than their direct-equivalent judgments,” and then using this as motivation to argue that “the fact that the best visualization for judgment did not necessarily lead to better decisions reveals the need to decouple these two tasks.” 

Let’s consider for a moment by what means we could try to eliminate ambiguity in comparing probability judgments to the associated decisions. For instance, if only incentivizing one of the two responses confounds things, we might try incentivizing the probability judgment with its own payoff function, and compare the results to the incentivized decision results. Would this allow us to directly study the difference between judgments and decision-making? 

I argue no. For one, we would need to use different scoring rules for the two different types of response, and things might rank differently depending on the rule (not to mention one rule might be easier to optimize under). But on top of this, I would argue that once you provide a scoring rule for the judgment question, it becomes hard to distinguish that response from a decision by any reasonable definition. In other words, you can’t eliminate confounds that could explain a difference between “judgment” and “decision” without turning the judgment into something indistinguishable from a decision. 

What is a decision? 

The paper by Oral et al. describes abundant confusion in the literature about the difference between judgment and decision-making, proposing that “One barrier to studying decision making effectively is that judgments and decisions are terms not well-defined and separated.“ They criticize various studies on visualizations for claiming to study decisions when they actually study judgments. Ultimately they describe their view as:

In summary, while decision making shares similarities with judgment, it embodies four distinguishing features: (I) it requires a choice among alternatives, implying a loss of the remaining alternatives, (II) it is future-oriented, (III) it is accompanied with overt or covert actions, and (IV) it carries a personal stake and responsibility for outcomes. The more of these features a judgment has, the more “decision-like” it becomes. When a judgment has all four features, it no longer remains a judgment and becomes a decision. This operationalization offers a fuzzy demarcation between judgment and decision making, in the sense that it does not draw a sharp line between the two concepts, but instead specifies the attributes essential to determine the extent to which a cognitive process is a judgment, a decision, or somewhere in-between [58], [59].

This captures components of other definitions of decision I’ve seen in research related to evaluating interfaces, e.g., as a decision as “a choice between alternatives,” typically involving “high stakes.” However, like these other definitions, I don’t think Oral et al.’s definition very clearly differentiates a decision from other forms of judgment. 

Take the “personal stake and responsibility for outcomes” part. How do we interpret this given that we are talking about subjects in an experiment, not decisions people are making in some more naturalistic context?    

In the context of an experiment, we control the stakes and one’s responsibility for their action via a scoring rule. We could instead ask people to imagine making some life or death decision in our study and call it high stakes, as many researchers do. But they are in an experiment, and they know it. In the real world people have goals, but in an experiment you have to endow them

So we should incentivize the question to ensure participants have some sense of the consequences associated with what they decide. We can ask them to separately report their beliefs, e.g., what they perceive some decision-relevant probability to be as we did in the 2020 study. But if we want to eliminate confounds between the decision and the judgment, we should incentivize the belief question too, ideally with a proper scoring rule so that it’s in their best interest to tell me the truth. Now both our decision task and our judgment task, from the standpoint of the experiment subject, would both seem to have some personal stake. So we can’t distinguish the decision easily based on its personal stakes.

Oral et al. might argue that the judgment question is still not a decision, because there are three other criteria for a decision according to their definition. Considering (I), will asking for a person’s belief require them to make a choice between alternatives? Yes, it will. Because any format we elicit their response in will naturally constrain it. Even if we just provide a text box to type in a number between 0 and 1, we’re going to get values rounded at some decimal place. So it’s hard to use “a choice among alternatives” as a distinguishing criteria in any actual experiment. 

What about (II), being future-oriented? Well, if I’m incentivizing the question then it will be just as future-oriented as my decision is, in that my payoff depends on my response and the ground truth, which is unknown to me until after I respond.

When it comes to (III), overt or covert actions, as in (I), in any actual experiment, my action space will be some form of constrained response space. It’s just that now my action is my choice of which beliefs to report. The action space might be larger, but again there is no qualitative difference between choosing what beliefs to report and choosing what action to report in some more constrained decision problem.

To summarize, by trying to put judgments and decisions on equal footing by scoring both, I’ve created something that seems to achieve Oral et al.’s definition of decision. While I do think there is a difference between a belief and a decision, I don’t think it’s so easy to measure these things without leaving open various other explanations for why the responses differ.

In their paper, Oral et al. sidestep incentivizing participants directly, assuming they will be intrinsically motivated. They report on two experiments where they used a task inspired by our 2020 paper (showing visualizations of expected score distributions and asking, Do you want the team with or without the new player, where the participant’s goal is to win a monetary award that requires scoring a certain number of points). Instead of incentivizing the decision by using the scoring rule to incentivize participants, they told them to try to be accurate. And instead of eliciting the corresponding probabilistic beliefs for the decision, they asked them two questions: Which option (team) is better?, and Which of the teams do you choose? They interpret the first answer as the judgment and the second as the decision. 

I can sort of see what they are trying to do here, but this seems like essentially the same task to me. Especially if you assume people are intrinsically motivated to be accurate and plan to evaluate responses using the same scoring rule, as they do. Why would we expect a difference between these two responses? To use a different example that came up in a discussion I was having with Jason Hartline, if you imagine a judge who cares only about doing the right thing (convicting the guilty and acquitting the innocent), who must decide whether to acquit or convict a defendant, why would you expect a difference (in accuracy) when you ask them ‘Is he guilty’ versus ‘Will you acquit or convict?’ 

In their first experiment using this simple wording, Oral et al. find no difference between responses to the two questions. In a second experiment they slightly changed the wording of the questions to emphasize that one was “your judgment” and one was “your decision.” This leads to what they say is suggestive evidence that people’s decisions are more accurate than their judgments. I’m not so sure.

The takeway

It’s natural to conceive of judgments or beliefs as being distinct from decisions. If we subscribe to a Bayesian formulation of learning from data, we expect the rational person to form beliefs about the state of the world and then choose the utility maximizing action given those beliefs. However, it is not so natural to try to directly compare judgments and decisions on equal footing in an experiment. 

More generally, when it comes to evaluating human decision-making (what we generally want to do in research related to interfaces) we gain little by preferring colloquial verbal definitions over the formalisms of statistical decision theory, which provide tools designed to evaluate people’s decisions ex-ante. It’s much easier to talk about judgment and decision-making when we have a formal way of representing a decision problem (i.e., state space, action space, data-generating model, scoring rule), and a shared understanding of what the normative process of learning from data to make a decision is (i.e., start with prior beliefs, update them given some signal, choose the action that maximizes your expected score under the data-generating model). In this case, we could get some insight into how judgments and decisions can differ simply by considering the process implied by expected utility theory. 

John Mandrola’s tips for assessing medical evidence

Ben Recht writes:

I’m a fan of physician-blogger John Mandrola. He had a nice response to your blog, using it as a jumping-off point for a short tutorial on his rather conservative approach to medical evidence assessment.

John is always even-tempered and constructive, and I thought you might enjoy this piece as an “extended blog comment.” I think he does a decent job answering the question at hand, and his approach to medical evidence appraisal is one I more or less endorse.

My post in question was called, How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer, and I concluded with this bit of helplessness: “I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature. I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.”

In his response, “Simple Rules to Understand Medical Claims,” Mandrola offers some tips:

The most important priors when it comes to medical claims are simple: most things don’t work. Most simple answer answers are wrong. Humans are complex. Diseases are complex. Single causes of complex diseases like cancer should be approached with great skepticism.

One of the studies sent to Gelman was a small trial finding that Vitamin D effectively treated COVID-19. The single-center open-label study enrolled 76 patients in early 2020. Even if this were the only study available, the evidence is not strong enough to move our prior beliefs that most simple things (like a Vitamin D tablet) do not work.

The next step is a simple search—which reveals two large randomized controlled trials of Vitamin D treatment for COVID-19, one published in JAMA and the other in the BMJ. Both were null.

You can use the same strategy for evaluating the claim that fish oil supplementation leads to higher rates of prostate cancer.

Start with prior beliefs. How is it possible that one exposure increases the rate of a disease that mostly affects older men? Answer: it’s not very possible. . . .

Now consider the claims linked in Gelman’s email.

– Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial

– Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial

While both studies stemmed from randomized trials neither were primary analyses. These were association studies using data from the main trial, and therefore, we should be cautious in making causal claims.

Now go to Google. This reveals two large randomized controlled trials of fish oil vs placebo therapy.

– The ASCEND trial of n-3 fatty acids in 15k patients with diabetes found “no significant between-group differences in the incidence of fatal or nonfatal cancer either overall or at any particular body site.” And I would add no difference in all-cause death.

– The VITAL trial included cancer as a primary endpoint. More than 25k patients were randomized. The conclusions: “Supplementation with n−3 fatty acids did not result in a lower incidence of major cardiovascular events or cancer than placebo.”

Mandrola concludes:

I am not arguing that every claim is simple. My case is that the evaluation process is slightly less daunting than Professor Gelman seems to infer.

Of course, medical science can be complicated. Content expertise can be important. . . .

But that does not mean we should take the attitude: “I have no idea what to think about these papers.”

I offer five basic rules of thumb that help in understanding medical claims:

1. Hold pessimistic priors

2. Be super-cautious about causal inferences from nonrandom observational comparisons

3. Look for big randomized controlled trials—and focus on their primary analyses

4. Know that stuff that really works is usually obvious (antibiotics for bacterial infection; AEDs to convert VF)

5. Respect uncertainty. Stay humble about most “positive” claims.

This all makes sense, as long as we recognize that randomized controlled trials are themselves nonrandom observational comparisons: the people in the study won’t in general be representative of the population of interest, also issues such as dropout, selection bias, realism of treatments, etc., which can be huge in medical trials. Experimentation is great; we just need to avoid the pitfalls of (a) idealizing studies that have randomization (we should avoid making the “chain is as strong as its strongest link” fallacy) and (b) disparaging observational data without assessing its quality.

For our discussion here, the most relevant bit of Mandrola’s advice was this from the comment thread:

Why are people going to a Political Scientist for medical advice? That is odd.

I hope Prof Gelman’s answer was based on a recognition that he doesn’t have the context and/or the historical background to properly interpret the studies.

The answer is: Yes, I do recognize my ignorance! Here’s what I wrote in the above-linked post:

I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers.

Mandrola’s advice given above seems reasonable to me. But it can be hard for me to apply in that he’s assuming a background medical knowledge that I don’t have. On the other hand, when it comes to social science, I know a lot. For example, when I saw that claim that women during a certain time of the month were 20 percentage points more likely to vote for Barack Obama, it was immediately clear this was ridiculous, because public opinion just doesn’t change that much. This had nothing to do with randomized trials or observational comparisons or anything like that; it was just too noisy of a study to learn anything.