Preregistration is a floor, not a ceiling.

Posted on March 17, 2024 4:37 PM by Andrew

This comes up from time to time, for example someone sent me an email expressing a concern that preregistration stifles innovation: if Fleming had preregistered his study, he never would’ve noticed the penicillin mold, etc.

My response is that preregistration is a floor, not a ceiling. Preregistration is a list of things you plan to do, that’s all. Preregistration does not stop you from doing more. If Fleming had followed a pre-analysis protocol, that would’ve been fine: there would have been nothing stopping him from continuing to look at his bacterial cultures.

As I wrote in comments to my 2022 post, “What’s the difference between Derek Jeter and preregistration?” (which I just added to the lexicon), you don’t preregister “the” exact model specification; you preregister “an” exact model specification, and you’re always free to fit other models once you’ve seen the data.

It can be really valuable to preregister, to formulate hypotheses and simulate fake data before gathering any real data. To do this requires assumptions—it takes work!—and I think it’s work that’s well spent. And then, when the data arrive, do everything you’d planned to do, along with whatever else you want to do.

Planning ahead should not get in the way of creativity. It should enhance creativity because you can focus your data-analytic efforts on new ideas rather than having to first figure out what defensible default thing you’re supposed to do.

Aaaand, pixels are free, so here’s that 2002 post in full:
Continue reading →

“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . .

Posted on March 17, 2024 9:51 AM by Andrew

John Carlin writes:

I wanted to draw your attention to a paper that I’ve just published as a preprint: On the uses and abuses of regression models: a call for reform of statistical practice and teaching (pending publication I hope in a biostat journal). You and I have discussed how to teach regression on a few occasions over the years, but I think with the help of my brilliant colleague Margarita Moreno-Betancur I have finally figured out where the main problems lie – and why a radical rethink is needed. Here is the abstract:

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for “modelling data”, underpinned by what we describe as the “true model myth”, according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualisation and usage should follow from the purpose at hand.

The paper is aimed at the biostat community, but I think the same issues apply very broadly at least across the non-physical sciences.

Interesting. I think this advice is roughly consistent with what Aki, Jennifer, and I say and do in our books Regression and Other Stories and Active Statistics.

More specifically, my take on teaching regression is similar to what Carlin and Moreno say, with the main difference being that I find that students have a lot of difficulty understanding plain old mathematical models. I spend a lot of time teaching the meaning of y = a + bx, how to graph it, etc. I feel that most regression textbooks focus too much on the error term and not enough on the deterministic part of the model. Also, I like what we say on the first page of Regression and Other Stories, about the three tasks of statistics being generalizing from sample to population, generalizing from control to treatment group, and generalizing from observed data to underlying constructs of interest. I think models are necessary for all three of these steps, so I do think that understanding models is important, and I’m not happy with minimalist treatments of regression that describe it as a way of estimating conditional expectations.

The first of these tasks is sampling inference, the second is causal inference, and the third refers to measurement. Statistics books (including my own) spend lots of time on sampling and causal inference, not so much on measurement. But measurement is important! For an example, see here.

If any of you have reactions to Carlin and Moreno’s paper, or if you have reactions to my reactions, please share them in comments, as I’m sure they’d appreciate it.

How often is there a political candidate such as Vivek Ramaswamy who is so much stronger in online polls than telephone polls?

Posted on March 16, 2024 9:30 AM by Andrew

Palko points to this news article, “The mystery of Vivek Ramaswamy’s rapid rise in the polls,” which states:

Ramaswamy’s strength comes almost entirely from polls conducted over the internet, according to a POLITICO analysis. In internet surveys over the past month — the vast majority of which are conducted among panels of people who sign up ahead of time to complete polls, often for financial incentives — Ramaswamy earns an average of 7.8 percent, a clear third behind Trump and DeSantis.

In polls conducted mostly or partially over the telephone, in which people are contacted randomly, not only does Ramaswamy lag his average score — he’s way back in seventh place, at just 2.6 percent.

There’s no singular, obvious explanation for the disparity, but there are some leading theories for it, namely the demographic characteristics and internet literacy of Ramaswamy’s supporters, along with the complications of an overly white audience trying to pronounce the name of a son of immigrants from India over the phone.”

And then, in order for a respondent to choose Ramaswamy in a phone poll, he or she will have to repeat the name back to the interviewer. And the national Republican electorate is definitely older and whiter than the country as a whole: In a recent New York Times/Siena College poll, more than 80 percent of likely GOP primary voters were white, and 38 percent were 65 or older.

‘When your candidate is named Vivek Ramaswamy,’ said one Republican pollster, granted anonymity to discuss the polling dynamics candidate, ‘that’s like DEFCON 1 for confusion and mispronunciation.’

Palko writes:

Keeping in mind that the “surge” was never big (maxed out at 10% and has been flat since), we’re talking about fairly small numbers in absolute terms, here are some questions:

1. How much do we normally expect phone and online to agree?

2. Ramaswamy generally scores around 3 times higher online than with phone. Have we seen that magnitude before?

3. How about a difficult name bias. Have we seen that before? How about Buttigieg, for instance? Did a foreign-sounding name hurt Obama in early polls?

4. Is the difference in demographics great enough to explain the difference? Aren’t things like gender and age normally reweighted?

5. Are there other explanations we should consider?

I don’t have any answers here, just one thought which is that it’s early in the campaign (I guess I should call it the pre-campaign, given that the primary elections haven’t started yet), and so perhaps journalists are reasoning that, even if this candidate is not very popular among voters, his active internet presence makes him a reasonable dark-horse candidate looking forward. An elite taste now but could perhaps spread to the non-political-junkies in the future? Paradoxically, the fact that Ramaswamy has this strong online support despite his extreme political stances could be taken as a potential sign of strength? I don’t know.

Conformal prediction and people

Posted on March 15, 2024 12:57 PM by Jessica Hullman

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes.

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution.

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction.

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially.

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

“Hot hand”: The controversy that shouldn’t be. And thinking more about what makes something into a controversy:

Posted on March 15, 2024 9:31 AM by Andrew

I was involved in a recent email discussion, leading to this summary:

There is no theoretical or empirical reason for the hot hand to be controversial. The only good reason for there being a controversy is that the mistaken paper by Gilovich et al. appeared first. At this point we should give Gilovich et al. credit for bringing up the hot hand as a subject of study and accept that they were wrong in their theory, empirics, and conclusions, and we can all move on. There is no shame in this for Gilovich et al. We all make mistakes, and what’s important is not the personalities but the research that leads to understanding, often through tortuous routes.

“No theoretical reason”: see discussion here, for example.

“No empirical reason”: see here and lots more in the recent literature.

“The only good reason . . . appeared first”: Beware the research incumbency rule.

More generally, what makes something a controversy? I’m not quite sure, but I think the news media play a big part. We talked about this recently in the context of the always-popular UFOs-as-space-aliens theory, which used to be considered a joke in polite company but now seems to have reached the level of controversy.

I don’t have anything systematic to say about all this right now, but the general topic seems very worthy of study.

“Here’s the Unsealed Report Showing How Harvard Concluded That a Dishonesty Expert Committed Misconduct”

Posted on March 14, 2024 7:34 PM by Andrew

Stephanie Lee has the story:

Harvard Business School’s investigative report into the behavioral scientist Francesca Gino was made public this week, revealing extensive details about how the institution came to conclude that the professor committed research misconduct in a series of papers.

The nearly 1,300-page document was unsealed after a Tuesday ruling from a Massachusetts judge, the latest development in a $25 million lawsuit that Gino filed last year against Harvard University, the dean of the Harvard Business School, and three business-school professors who first notified Harvard of red flags in four of her papers. All four have been retracted. . . .

According to the report, dated March 7, 2023, one of Gino’s main defenses to the committee was that the perpetrator could have been someone else — someone who had access to her computer, online data-storage account, and/or data files.

Gino named a professor as the most likely suspect. The person’s name was redacted in the released report, but she is identified as a female professor who was a co-author of Gino’s on a 2012 now-retracted paper about inducing honest behavior by prompting people to sign a form at the top rather than at the bottom. . . . Allegedly, she was “angry” at Gino for “not sufficiently defending” one of their collaborators “against perceived attacks by another co-author” concerning an experiment in the paper.

But the investigation committee did not see a “plausible motive” for the other professor to have committed misconduct by falsifying Gino’s data. “Gino presented no evidence of any data falsification actions by actors with malicious intentions,” the committee wrote. . . .

Gino’s other main defense, according to the report: Honest errors may have occurred when her research assistants were coding, checking, or cleaning the data. . . .

Again, however, the committee wrote that “she does not provide any evidence of [research assistant] error that we find persuasive in explaining the major anomalies and discrepancies.”

The full report is at the link.

Some background is here, also here, and some reanalyses of the data are linked here.

Now we just have to get to the bottom of the story about the shredder and the 80-pound rock and we’ll pretty much have settled all the open questions in this field.

We’ve already determined that the “burly coolie” story and the “smallish town” story never happened.

It’s good we have dishonesty experts. There’s a lot of dishonesty out there.

Abraham Lincoln and confidence intervals

Posted on March 14, 2024 9:04 AM by Andrew

This one from 2017 is good; I want to share it with all of you again:

Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical practice. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any particular interval that you might see.

P.S. More here.

You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Posted on March 13, 2024 3:00 PM by Bob Carpenter

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

“We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

“We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

“We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

Tosh and Dasgupta. 2016. Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard. JMLR

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

Lucas. 2014. Ising formulations of many NP problems. Frontiers in Physics.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Posted on March 13, 2024 9:08 AM by Andrew

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Fully funded doctoral student positions in Finland

Posted on March 13, 2024 6:30 AM by Aki Vehtari

There is a new government funded Finnish Doctoral Program in AI. Research topics include Bayesian inference, modeling and workflows as part of fundamental AI. There is a big joint call, where you can choose the supervisor you want to work with. I (Aki) am also one of the supervisors. Come work with me or share the news! The first call deadline is April 2, and the second call deadline in fall 2024. See how to apply at https://fcai.fi/doctoral-program, and more about my research at my web page.

Zotero now features retraction notices

Posted on March 12, 2024 9:47 AM by Andrew

David Singerman writes:

Like a lot of other humanities and social sciences people I use Zotero to keep track of citations, create bibliographies, and even take & store notes. I also am not alone in using it in teaching, making it a required tool for undergraduates in my classes so they learn to think about organizing their information early on. And it has sharing features too, so classes can create group bibliographies that they can keep using after the semester ends.

Anyway my desktop client for Zotero updated itself today and when it relaunched I had a big red banner informing me that an article in my library had been retracted! I didn’t recognize it at first, but eventually realized that was because it was an article one of my students had added to their group library for a project.

The developers did a good job of making the alert unmissable (i.e. not like a corrections notice in a journal), the full item page contains lots of information and helpful links about the retraction, and there’s a big red X next to the listing in my library. See attached screenshots.

The way they implemented it will also help the teaching component, since a student will get this alert too.

Singerman adds this P.S.:

This has reminded me that some time ago you posted something about David Byrne, and whatever you said, it made me think of David Byrne’s wonderful appearance on the Colbert Report.

What was amazing to me when I saw it was that it’s kind of like a battle between Byrne’s inherent weirdness and sincerity, and Colbert’s satirical right-wing bloviator character. Usually Colbert’s character was strong enough to defeat all comers, but . . . decide for yourself.

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Posted on March 11, 2024 9:43 AM by Andrew

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Refuted papers continue to be cited more than their failed replications: Can a new search engine be built that will fix this problem?

Posted on March 10, 2024 9:11 AM by Andrew

Paul von Hippel writes:

Stuart Buck noticed your recent post on A WestLaw for Science. This is something that Stuart and I started talking about last year, and Stuart, who trained as an attorney, believes it was first suggested by a law professor about 15 years ago.

Since the 19th century, the legal profession has had citation indices that do far more than count citations and match keywords. Resources like Shepard’s Citations—first printed in 1873 and now published online along with competing tools such as JustCite, KeyCite, BCite, and SmartCite—do not just find relevant cases and statutes; they show lawyers whether a case or statute is still “good law.” Legal citation indexes show lawyers which cases have been affirmed or cited approvingly, and which have been criticized, reversed, or overruled by later courts.

Although Shepard’s Citations inspired the first Science Citation Index in 1960, which in turn inspired tools like Google Scholar, today’s academic search engine still rely primarily on citation counts and keywords. As a result, many scientists are like lawyers who walk into the courtroom unaware that a case central to their argument has been overruled.

Kind of, but not quite. A key difference is that in the courtroom there is some reasonable chance that the opposing lawyer or the judge will notice that the key case has been overruled, so that your argument that hinges on that case will fail. You have a clear incentive to not rely on overruled cases. In science, however, there’s no opposing lawyer and no judge: you can build an entire career on studies that fail to replicate, and no problem at all, as long as you don’t pull any really ridiculous stunts.

Hippel continues:

Let me share a couple of relevant articles that we recently published.

One, titled “Is Psychological Science Self-Correcting?, reports that replication studies, whether successful or unsuccessful, rarely have much effect on citations to the studies being replicated. When a finding fails to replicate, most influential studies sail on, continuing to gather citations at a similar rate for years, as though the replication had never been tried. The issue is not limited to psychology and raises serious questions about how quickly the scientific community corrects itself, and whether replication studies are having the correcting influence that we would like them to have. I considered several possible reasons for the persistent influence on studies that failed to replicate, and concluded that academic search engines like Google Scholar may well be part of the problem, since they prioritize highly cited articles, replicable or not, perpetuating the influence of questionable findings.

The finding that replications don’t affect citations has itself replicated pretty well. A recent blog post by Bob Reed at the University of Canterbury, New Zealand, summarized five recent papers that showed more or less the same thing in psychology, economics, and Nature/Science publications.

In a second article, published just last week in Nature Human Behaviour, Stuart Buck and I suggest ways to Improve academic search engines to reduce scholars’ biases. We suggest that the next generation of academic search engines should do more than count citations, but should help scholars assess studies’ rigor and reliability. We also suggest that future engines should be transparent, responsive and open source.

This seems like a reasonable proposal. The good news is that it’s not necessary for their hypothetical new search engine to dominate or replace existing products. People can use Google Scholar to find the most cited papers and use this new thing to inform about rigor and reliability. A nudge in the right direction, you might say.

A new piranha paper

Posted on March 9, 2024 9:13 AM by Andrew

Kris Hardies points to this new article, Impossible Hypotheses and Effect-Size Limits, by Wijnand and Lennert van Tilburg, which states:

There are mathematical limits to the magnitudes that population effect sizes can take within the common multivariate context in which psychology is situated, and these limits can be far more restrictive than typically assumed. The implication is that some hypothesized or preregistered effect sizes may be impossible. At the same time, these restrictions offer a way of statistically triangulating the plausible range of unknown effect sizes.

This is closely related to our Piranha Principle, which we first formulated here and then followed up with this paper. It’s great to see more work being done in this area.

Statistical practice as scientific exploration

Posted on March 8, 2024 9:54 AM by Andrew

This was originally going to happen today, 8 Mar 2024, but it got postponed to some unspecified future date, I don’t know why. In the meantime, here’s the title and abstract:

Statistical practice as scientific exploration

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? After a brief review of that topic (in short, I am a Bayesian but not an inductivist), I discuss the ways in which researchers when using and developing statistical methods are acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modeling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formally tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow, as described in part in this article: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

The whole thing is kind of mysterious to me. In the email invitation it was called the UPenn Philosophy of Computation and Data Workshop, but then they sent me a flyer where it was called the Philosophy of A.I., Data Science, & Society Workshop in the Quantitative Theory and Methods Department at Emory University. It was going to be on zoom so I guess the particular university affiliation didn’t matter.

In any case, the topic is important, and I’m always interested in speaking with people on the philosophy of statistics. So I hope they get around to rescheduling this one.

Relating t-statistics and the relative width of confidence intervals

Posted on March 7, 2024 4:20 PM by Dean Eckles

How much does a statistically significant estimate tell us quantitatively? If you have an estimate that’s statistically distinguishable from zero with some t-statistic, what does that say about your confidence interval?

Perhaps most simply, with a t-statistic of 2, your 95% confidence intervals will nearly touch 0. That is, they’re just about 100% wide in each direction. So they cover everything from nothing (0%) to around double your estimate (200%).

More generally, for a 95% confidence interval (CI), 1.96/t — or let’s say 2/t — gives the relative half-width of the CI. So for an estimate with t=4, then everything from half your estimate to 150% of your estimate is in the 95% CI.

For other commonly-used nominal coverage rates, the confidence intervals have a width that is less conducive to a rule of thumb, since the critical value isn’t something nice like ~2. (For example, with 99% CIs, the Gaussian critical value is 2.58.) Let’s look at 90, 95, and 99% confidence intervals for t = 1.96, 3, 4, 5, and 6:

You can see, for example, that even at t=5, the halved point estimate is still inside the 99% CI. Perhaps this helpfully highlights how much more precision you need to confidently state the size of an effect than just to reject the null.

These “relative” confidence intervals are just this smooth function of t (and thus the p-value), as displayed here:

It is only when the statistical evidence against the null is overwhelming — “six sigma” overwhelming or more —that you’re also getting tight confidence intervals in relative terms. Among other things, this highlights that if you need to use your estimates quantitatively, rather than just to reject the null, default power analysis is going to be overoptimistic.

A caveat: All of this just considers standard confidence intervals based on normal theory labeled by their nominal coverage. Of course, many p < 0.05 estimates may have been arrived at by wandering through a garden of forking paths, or precisely because it passed a statistical significance filter. Then these CIs are not going to conditionally have their advertised coverage.

Is the 2024 New York presidential primary really an “important election”?

Posted on March 7, 2024 4:12 PM by Andrew

This came in the email today:

Hmmmm . . . I dunno if it’s so accurate to describe the upcoming presidential primary as an “important election.” It’s pretty much uncontested for both parties, no?

With journals, it’s all about the wedding, never about the marriage.

Posted on March 7, 2024 9:45 AM by Andrew

John “not Jaws” Williams writes:

Here is another example of how hard it is to get erroneous publications corrected, this time from the climatology literature, and how poorly peer review can work.

From the linked article, by Gavin Schmidt:

Back in March 2022, Nicola Scafetta published a short paper in Geophysical Research Letters (GRL) . . . We (me, Gareth Jones and John Kennedy) wrote a note up within a couple of days pointing out how wrongheaded the reasoning was and how the results did not stand up to scrutiny. . . .

After some back and forth on how exactly this would work (including updating the GRL website to accept comments), we reformatted our note as a comment, and submitted it formally on December 12, 2022. We were assured from the editor-in-chief and publications manager that this would be a ‘streamlined’ and ‘timely’ review process. With respect to our comment, that appeared to be the case: It was reviewed, received minor comments, was resubmitted, and accepted on January 28, 2023. But there it sat for 7 months! . . .

The issue was that the GRL editors wanted to have both the comment and a reply appear together. However, the reply had to pass peer review as well, and that seems to have been a bit of a bottleneck. But while the reply wasn’t being accepted, our comment sat in limbo. Indeed, the situation inadvertently gives the criticized author(s) an effective delaying tactic since, as long as a reply is promised but not delivered, the comment doesn’t see the light of day. . . .

All in all, it took 17 months, two separate processes, and dozens of emails, who knows how much internal deliberation, for an official comment to get into the journal pointing issues that were obvious immediately the paper came out. . . .

The odd thing about how long this has taken is that the substance of the comment was produced extremely quickly (a few days) because the errors in the original paper were both commonplace and easily demonstrated. The time, instead, has been entirely taken up by the process itself. . . .

Schmidt also asks a good question:

Why bother? . . . Why do we need to correct the scientific record in formal ways when we have abundant blogs, PubPeer, and social media, to get the message out?

His answer:

Since journals remain extremely reluctant to point to third party commentary on their published papers, going through the journals’ own process seems like it’s the only way to get a comment or criticism noticed by the people who are reading the original article.

Good point. I’m glad that there are people like Schmidt and his collaborators who go to the trouble to correct the public record. I do this from time to time, but mostly I don’t like the stress of dealing with the journals so I’ll just post things here.

My reaction

This story did not surprise me. I’ve heard it a million times, and it’s often happened to me, which is why I once wrote an article called It’s too hard to publish criticisms and obtain data for replication.

Journal editors mostly hate to go back and revise anything. They’re doing volunteer work, and they’re usually in it because they want to publish new and exciting work. Replications, corrections, etc., that’s all seen as boooooring.

With journals, it’s all about the wedding, never about the marriage.

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Posted on March 6, 2024 5:00 PM by Andrew

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Defining optimal reliance on model predictions in AI-assisted decisions

Posted on March 6, 2024 12:31 PM by Jessica Hullman

This is Jessica. In a previous post I mentioned methodological problems with studies of AI-assisted decision-making, such as are used to evaluate different model explanation strategies. The typical study set-up gives people some decision task (e.g., Given the features of this defendant, decide whether to convict or release), has them make their decision, then gives them access to a model’s prediction, and observes if they change their mind. Studying this kind of AI-assisted decision task is of interest as organizations deploy predictive models to assist human decision-making in domains like medicine and criminal justice. Ideally, the human is able to use the model to improve the performance they’d get on their own or if the model was deployed without a human in the loop (referred to as complementarity).

The most frequently used definition of appropriate reliance is that if the person goes with the model prediction but it’s wrong, this is overreliance. If they don’t go with the model prediction but it’s right, this is labeled underreliance. Otherwise it is labeled appropriate reliance.

This definition is problematic for several reasons. One is that the AI might have a higher probability than the human of selecting the right action, but still end up being wrong. It doesn’t make sense to say the human made the wrong choice by following it in such cases. Because it’s based on post-hoc correctness, this approach confounds two sources of non-optimal human behavior: not accurately estimating the probability that the AI is correct versus not making the right choice of whether to go with the AI or not given one’s beliefs.

By scoring decisions in action space, it also equally penalizes not choosing the right action (which prediction to go with) in a scenario where the human and the AI have very similar probabilities of being correct and one where either the AI or human has a much higher probability of being correct. Nevertheless, there are many papers doing it this way, some with hundreds of citations.

In A Statistical Framework for Measuring AI Reliance, Ziyang Guo, Yifan Wu, Jason Hartline and I write:

Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s prediction from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational agent facing the same decision task as the behavioral agents.

It’s a similar approach to our rational agent framework for data visualization, but here we assume a setup in which the decision-maker receives a signal consisting of the feature values for some instance, the AI’s prediction, the human’s prediction, and optionally some explanation of the AI decision. The decision-maker chooses which prediction to go with.

We can compute the upper bound or best attainable performance in such a study (rational benchmark) as the expected score of a rational decision-maker on a randomly drawn decision task. The rational decision-maker has prior knowledge of the data generating model (the joint distribution over the signal and ground truth state). Seeing the instance in a decision trial, they accurately perceive the signal, arrive at Bayesian posterior beliefs about the distribution of the payoff-relevant state, then choose the action that maximizes their expected utility over the posterior. We calculate this in payoff space as defined by the scoring rule, such that the cost of an error can vary in magnitude.

We can define the “value of rational complementation” for the decision-problem at hand by also defining the rational agent baseline: the expected performance of the rational decision-maker without access to the signal on a randomly chosen decision task from the experiment. Because it represents the score the rational agent would get if they could rely only on their prior beliefs about the data-generating model, the baseline is the expected score of a fixed strategy that always chooses the better of the human alone or the AI alone.

If designing or interpreting an experiment on AI reliance, the first thing we might want to do is look at how close to the benchmark the baseline is. We want to see a decent amount of room for the human-AI team to improve performance over the baseline, as in the image above. If the baseline is very close to the benchmark, it’s probably not worth adding the human.

Once we have run an experiment and observed how well people make these decisions, we can treat the value of complementation as a comparative unit for interpreting how much value adding the human contributes over making the decision with the baseline. We do this by normalizing the observed score within the range where the rational agent baseline is 0 and the rational agent benchmark is 1 and looking at where the observed human+AI performance lies. This also provides a useful sense of effect size when we are comparing different settings. For example, if we have two model explanation strategies A and B we compared in an experiment, we can calculate expected human performance on a randomly drawn decision trial under A and under B and measure the improvement by calculating (score_A − score_B)/value of complementation.

We can also decompose sources of error in study participants’ performance. To do this, we define a “mis-reliant” rational decision-maker benchmark, which is the expected score of a rational agent constrained to the reliance level that we observe in study participants. Hence this is the best score a decision-maker who relies on the AI the same overall proportion of the time could attain had they perfectly perceived the probability that the AI is correct relative to the probability that the human is correct on every decision task. Since the mis-reliant benchmark and the study participants have the same reliance level (i.e., they both accept the AI’s prediction the same percentage of the time), the difference in their decisions lies entirely in accepting the AI predictions at different instances. The mis-reliant rational decision-maker always accepts the top X% AI predictions ranked by performance advantage over human predictions, but study participants may not.

By calculating the mis-reliant rational benchmark for the observed reliance level of study participants, we can distinguish between reliance loss, the loss from over- or under-relying on the AI (defined as the difference between the rational benchmark and mis-reliant benchmark divided by the value of rational complementation), and discrimination loss, the loss from not accurately differentiating the instances where the AI is better than the human from the ones where the human is better than the AI (defined as the difference between the mis-reliant benchmark and the expected score of participants divided by the value of rational complementation).

We apply this approach to some well-known studies on AI reliance and can extend the original interpretations to varying degrees, ranging from observing a lack of potential to see complementarity in the study given how much better the AI was going in than the human, to the original interpretation missing that participants’ reliance levels were pretty close to rational and they just couldn’t distinguish which signals they should go with AI on. We also observe researchers making comparisons across conditions for which the upper and lower bounds on performance differ without accounting for the difference.

There’s more in the paper – for example, we discuss how the rational benchmark, our upper bound representing the expected score of a rational decision-maker on a randomly chosen decision task, may be overfit to the empirical data. This occurs when the signal space is very large (e.g., the instance is a text document) such that we observe very few human predictions per signal. We describe how the rational agent could determine the best response on the optimal coarsening of the empirical distribution, such that the true rational benchmark is bounded by this and the overfit upper bound.

While we focused on showing how to improve research on human-AI teams, I’m excited about the potential for this framework to help organizations as they consider whether deploying an AI is likely to improve some human decision process. We are currently thinking about what sorts of practical questions (beyond Could pairing a human and AI be effective here?) we can answer using such a framework.