“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?”

Alexey Guzey asks:

How much have you thought about AI and when will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?

My first reply: I guess that AI can already do better science than Matthew “Sleeplord” Walker, Brian “Pizzagate” Wansink, Marc “Schoolmarm” Hauser, or Satoshi “Freakonomics” Kanazawa. So some humans are already obsolete, when it comes to producing science.

OK, let me think a bit more. I guess it depends on what kind of scientific research we’re talking about. Lots of research can be automated, and I could easily imagine an AI that can do routine analysis of A/B tests better than a human could. Indeed, thinking of how the AI could do this is a good way to improve how humans currently do things.

For bigger-picture research, I don’t see AI doing much. But a big problem now with human research is that human researchers want to take routine research and promote it as big-picture (see Walker, Wansink, Kanazawa, etc.). I guess that an AI could be programmed to do hype and create Ted talk scripts.

Guzey’s response:

What’s “routine research”? Would someone without a college degree be able to do it? Is routine research simply defined as such that can be done by a computer now?

My reply: I guess the computer couldn’t really do the research, as it that would require filling test tubes or whatever. I’m thinking that the computer could set up the parameters of an experiment, evaluate measurements, choose sample size, write up the analysis, etc. It would have to be some computer program that someone writes. If you just fed the scientific literature into a chatbot, I guess you’d just get millions more crap papers, basically reproducing much of what is bad about the literature now, which is the creation of articles that give the appearance of originality and relevance while actually being empty in content.

But, now that I’m writing this, I think Guzey is asking something slightly different: he wants to know when a general purpose “scientist” computer could be written, kind of like a Roomba or a self-driving car, but instead of driving around, it would read the literature, perform some sort of sophisticated meta-analyses, and come up with research ideas, like “Run an experiment on 500 people testing manipulations A and B, measure pre-treatment variables U and V, and look at outcomes X and Y.” I guess the first step would be to try to build such a system in a narrow environment such as testing certain compounds that are intended to kill bacteria or whatever.

I don’t know. On one hand, even the narrow version of this problem sounds really hard; on the other hand, our standards for publishable research are so low that it doesn’t seem like it would be so difficult to write a computer program that can fake it.

Maybe the most promising area of computer-designed research would be in designing new algorithms, because there the computer could actually perform the experiment; no laboratory or test tubes required, so the experiments can be run automatically and the computer could try millions of different things.

The paradox of replication studies: A good analyst has special data analysis and interpretation skills. But it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions.

Benjamin Kircup writes:

I think you will be very interested to see this preprint that is making the rounds: Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology (ecoevorxiv.org)

I see several ties to social science, including the study of how data interpretation varies across scientists studying complex systems; but also the sociology of science. This is a pretty deep introspection for a field; and possibly damning. The garden of forking paths is wide. They cite you first, which is perhaps a good sign.

Ecologists frequently pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be? It would all be mechanistic, rote, unimaginative, uninteresting. In general, actually, that’s the perception many have of typical biostatistics. It leaves insights on the table by being terribly rote and using the most conservative kinds of analytic tools (yet another t-test, etc). The price of this is that different people will reach different conclusions with the same data – and that’s not typically discussed, but raises questions about the literature as a whole.

One point: apparently the peer reviews didn’t systematically reward finding large effect sizes. That’s perhaps counterintuitive and suggests that the community isn’t rewarding bias, at least in that dimension. It would be interesting to see what you would do with the data.

The first thing I noticed is that the paper has about a thousand authors! This sort of collaborative paper kind of breaks the whole scientific-authorship system.

I have two more serious thoughts:

1. Kircup makes a really interesting point, that analysts “pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be?”, but then it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions. There really does seem to be a fundamental paradox here. On one hand, different analysts do different things—Pete Palmer and Bill James have different styles, and you wouldn’t expect them to come to the same conclusions—; on the other hand, we expect strong results to appear no matter who is analyzing the data.

A partial resolution to this paradox is that much of the skill of data analysis and interpretation comes in what questions to ask. In these replication projects (I think Bob Carpenter calls them “bake-offs”), several different teams are given the same question and the same data and then each do their separate analysis. David Rothschild and I did one of these; it was called We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results, and we were the only analysts of that Florida poll from 2016 that estimated Trump to be in the lead. Usually, though, data and questions are not fixed, despite what it might look like when you read the published paper. Still, there’s something intriguing about what we might call the Analyst’s Paradox.

2. Regarding his final bit (“apparently the peer reviews didn’t systematically reward finding large effect sizes”), I think Kircup is missing the point. Peer reviews don’t systematically reward finding large effect sizes. What they systematically reward is finding “statistically significant” effects, i.e. those that are at least two standard errors from zero. But by restricting yourself to those, you automatically overestimate effect sizes, as I discussed to interminable length in papers such as Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors and The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. So they are rewarding bias, just indirectly.

Progress in 2023, Leo edition

Following Andrew, Aki, Jessica, and Charles, and based on Andrew’s proposal, I list my research contributions for 2023.

Published:

  1. Egidi, L. (2023). Seconder of the vote of thanks to Narayanan, Kosmidis, and Dellaportas and contribution to the Discussion of ‘Flexible marked spatio-temporal point processes with applications to event sequences from association football’Journal of the Royal Statistical Society Series C: Applied Statistics72(5), 1129.
  2. Marzi, G., Balzano, M., Egidi, L., & Magrini, A. (2023). CLC Estimator: a tool for latent construct estimation via congeneric approaches in survey research. Multivariate Behavioral Research, 58(6), 1160-1164.
  3. Egidi, L., Pauli, F., Torelli, N., & Zaccarin, S. (2023). Clustering spatial networks through latent mixture models. Journal of the Royal Statistical Society Series A: Statistics in Society186(1), 137-156.
  4. Egidi, L., & Ntzoufras, I. (2023). Predictive Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 929-934). Pearson.
  5. Macrì Demartino, R., Egidi, L., & Torelli, N. (2023). Power priors elicitation through Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 923-928). Pearson.

Preprints:

  1. Consonni, G., & Egidi, L. (2023). Assessing replication success via skeptical mixture priorsarXiv preprint arXiv:2401.00257. Submitted.

Softwares:

    CLC estimator

  • free and open-source app to estimate latent unidimensional constructs via congeneric approaches in survey research (Marzi et al., 2023)

   footBayes package (CRAN version 0.2.0)

   pivmet package (CRAN version 0.5.0)

I hope and guess that the paper dealing with the replication crisis, “Assessing replication success via skeptical mixture priors” with Guido Consonni, could have good potential in the Bayesian assesment of replication success in social and hard sciences; this paper can be seen as an extension of the paper written by Leonhard Held and Samuel Pawel entitled “The Sceptical Bayes Factor for the Assessment of Replication Success“.  Moreover, I am glad that the paper “Clustering spatial networks through latent mixture models“, focused on a model-based clustering approach defined in a hybrid latent space, has been finally published in JRSS A.

Regarding softwares, the footBayes package, a tool to fit the most well-known soccer (football) models through Stan and maximum likelihood methods, has been deeply developed and enriched with new functionalities (2024 objective: incorporate CmdStan with VI/Pathfinder algorithms and write a package’s paper in JSS/R Journal format).

Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024)

Here’s the link:

Learning from mistakes

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We learn so much from mistakes! How can we structure our workflow so that we can learn from mistakes more effectively? I will discuss a bunch of examples where I have learned from mistakes, including data problems, coding mishaps, errors in mathematics, and conceptual errors in theory and applications. I will also discuss situations where researchers have avoided good learning opportunities. We can then try to use all these cases to develop some general understanding of how and when we learn from errors in the context of the fractal nature of scientific revolutions.

The video is here.

It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.

To put it another way . . . when you find out you made a mistake, you learn three things:

1. Now: Your original statement was wrong.

2. Implications for the future: Beliefs and actions that flow from that original statement may be wrong. You should investigate your reasoning going forward and adjust to account for your error.

3. Implications for the past: Something in your existing workflow led to your error. You should trace your workflow, see how that happened, and alter your workflow accordingly.

In poker, they say to evaluate the strategy, not the play. In quality control, they say to evaluate the process, not the individual outcome. Similarly with workflow.

As we’ve discussed many many times in this space (for example, here), it makes me want to screeeeeeeeeeam when people forego this opportunity to learn. Why do people, sometimes very accomplished people, give up this opportunity? I’m speaking here of people who are trying their best, not hacks and self-promoters.

The simple answer for why even honest people will avoid admitting clear mistakes is that it’s embarrassing for them to admit error, they don’t want to lose face.

The longer answer, I’m afraid, is that at some level they recognize issues 1, 2, and 3 above, and they go to some effort to avoid confronting item 1 because they really really don’t want to face item 2 (their beliefs and actions might be affected, and they don’t want to hear that!) and item 3 (they might be going about everything all wrong, and they don’t want to hear that either!).

So, paradoxically, the very benefits of learning from error are scary enough to some people that they’ll deny or bury their own mistakes. Again, I’m speaking here of otherwise-sincere people, not of people who are willing to lie to protect their investment or make some political point or whatever.

In my talk, I’ll focus on my own mistakes, not those of others. My goal is for you in the audience to learn how to improve your own workflow so you can catch errors faster and learn more from them, in all three senses listed above.

P.S. Planning a talk can be good for my research workflow. I’ll get invited to speak somewhere, then I’ll write a title and abstract that seems like it should work for that audience, then the existence of this structure gives me a chance to think about what to say. For example, I’d never quite thought of the three ways of learning from error until writing this post, which in turn was motivated by the talk coming up. I like this framework. I’m not claiming it’s new—I guess it’s in Pólya somewhere—, just that it will help my workflow. Here’s another recent example of how the act of preparing an abstract helped me think about a topic of continuing interest to me.

“My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. . . . What I hear, instead, is the following . . .”

Economic historian Tim Guinnane writes:

I have a general question that I have not seen addressed on your blog. Often this question turns into a narrow question about retracting papers, but I think that short-circuits an important discussion.

Like many in economic history, I am increasingly worried that much research in recent years reflects p-hacking, misrepresentation of the history, useless data, and other issues. I realize that the technical/statistical issues differ from paper to paper.

What I see is something like the following. You can use this paper as a concrete example, but the problems are much more widespread. We document a series of bad research practices. The authors played games with controls to get the “right” answer for the variable of interest. (See Table 1 of the paper). In the text they misrepresent the definitions of variables used in regressions; we show that if you use the stated definition, their results disappear. They use the wrong degrees of freedom to compute error bounds (in this case, they had to program the bounds by hand, since stata automatically uses the right df). There are other and to our minds more serious problems involved in selectively dropping data, claiming sources do not exist, etc.

Step back from any particular problem. How should the profession think about claims such as ours? My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. The journals may not want to retract such work, but there should be support for publishing articles that point out such problems.

What I hear, instead, is the following. A paper estimates beta as .05 with a given SE. Even if we show that this is cooked—that is, that beta is a lot smaller or the SE a lot larger if you do not throw in extraneous regressors, or play games with variable definitions—then ours is not really a result. It is instead, I am told, incumbent on the critic to start with beta=.05 as the null, and show that doing things correctly rejects that null in favor of something less than .05 (it is characteristic of most of this work that there really is no economic theory, so the null is always “X does not matter” which boils down to “this beta is zero.” And very few even tell us whether the correct test is one- or two-sided).

This pushback strikes me as weaponizing the idea of frequentist hypothesis testing. To my mind, if I can show that beta=.05 comes from a cooked regression, then we need to start over. That estimate can be ignored; it is just one of many incorrect estimates one can generated by doing things inappropriately. It actually gives the unscrupulous an incentive to concoct more outlandish betas which are then harder to reject. More generally, it puts a strange burden of proof on critics. I have discussed this issue with some folks in natural sciences who find the pushback extremely difficult to understand. They note what I think is the truth: it encourages bad research behavior by suppressing papers that demonstrate that bad behavior.

It might be opportune to have a general discussion of these sorts of issues on your website. The Gino case raises something much simpler, I think. I fear that it will in some ways lower the bar: so long as someone is not actively making up their data (which I realize has not been proven, in case this email gets subpoenaed!) then we do not need to worry about cooking results.

My reply: You raise several issues that we’e discussed on occasion (for some links, see here):

1. The “Research Incumbency Rule”: Once an article is published in some approved venue, it is taken as truth. Criticisms which would absolutely derail a submission in pre-publication review can be brushed aside if they are presented after publication. This is what you call “the burden of proof on critics.”

2. Garden of forking paths.

3. Honesty and transparency are not enough. Work can be non-fraudulent but still be crap.

4. “Passive corruption” when people know there’s bad work but they don’t do anything about it.

5. A disturbingly casual attitude toward measurement; see here for an example: https://statmodeling.stat.columbia.edu/2023/10/05/no-this-paper-on-strip-clubs-and-sex-crimes-was-never-gonna-get-retracted-also-a-reminder-of-the-importance-of-data-quality-and-a-reflection-on-why-researchers-often-think-its-just-fine-to-publ/ Many economists and others seem to have been brainwashed into thinking that it’s ok to have bad measurement because attenuation bla bla . . . They’re wrong.

He responded: If you want an example of economists using stunningly bad data and making noises about attenuation, see here.

The paper in question has the straightforward title, “We Do Not Know the Population of Every Country in the World for the Past Two Thousand Years.”

Michael Wiebe has several new replications written up on his site.

Michael Wiebe writes:

I have several new replications written up on my site.

Moretti (2021) studies whether larger cities drive more innovation, but I find that the event study and instrumental variable results are due to coding errors. This means that the main OLS results should not be interpreted causally.

Atwood (2022) studies the long-term economic effects of the measles vaccine. I run an event study and find that the results are explained by trends, instead of a treatment effect of the vaccine.

I [Wiebe] am also launching a Patreon, so that I can work on replications full-time.

Interesting. We’ve discussed some of Wiebe’s investigations and questions in the past; see here, here, here, and here (on the topics of promotion in China, election forecasting, historical patents, and forking paths, respectively). So, good to hear that he’s still at it!

Regarding the use of “common sense” when evaluating research claims

I’ve often appealed to “common sense” or “face validity” when considering unusual research claims. For example, the statement that single women during certain times of the month were 20 percentage points more likely to support Barack Obama, or the claim that losing an election for governor increases politicians’ lifespan by 5-10 years on average, or the claim that a subliminal smiley face flashed on a computer screen causes large changes in people’s attitudes on immigration, or the claim that attractive parents are 36% more likely to have girl babies . . . these claims violated common sense. Or, to put it another way, they violated my general understanding of voting, health, political attitudes, and human reproduction.

I often appeal to common sense, but that doesn’t mean that I think common sense is always correct or that we should defer to common sense. Rather, common sense represents some approximation of a prior distribution or existing model of the world. When our inferences contradict our expectations, that is noteworthy (in a chapter 6 of BDA sort of way), and we want to address this. It could be that addressing this will result in a revision of “common sense.” That’s fine, but if we do decide that our common sense was mistaken, I think we should make that statement explicitly. What bothers me is when people report findings that contradict common sense and don’t address the revision in understanding that would be required to accept that.

In each of the above-cited examples (all discussed at various times on this blog), there was a much more convincing alternative explanation for the claimed results, given some mixture of statistical errors and selection bias (p-hacking or forking paths). That’s not to say the claims are wrong (Who knows?? All things are possible!), but it does tell us that we don’t need to abandon our prior understanding of these things. If we want to abandon our earlier common-sense views, that would be a choice to be made, an affirmative statement that those earlier views are held so weakly that they can be toppled by little if any statistical evidence.

P.S. Perhaps relevant is this recent article by Mark Whiting and Duncan Watts, “A framework for quantifying individual and collective common sense.”

Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins on 26 Apr)

Storytelling and Scientific Understanding

Andrew Gelman and Thomas Basbøll

Storytelling is central to science, not just as a tool for broadcasting scientific findings to the outside world, but also as a way that we as scientists understand and evaluate theories. We argue that, for this purpose, a story should be anomalous and immutable; that is, it should be surprising, representing some aspect of reality that is not well explained by existing models of the world, and have details that stand up to scrutiny.

We consider how this idea illuminates some famous stories in social science involving soldiers in the Alps, Chinese boatmen, and trench warfare, and we show how it helps answer literary puzzles such as why Dickens had all those coincidences, why authors are often so surprised by what their characters come up with, and why the best alternative history stories have the feature that, in these stories, our “real world” ends up as the deeper truth. We also discuss connections to chatbots and human reasoning, stylized facts and puzzles in science, and the millionth digit of pi.

At the center our framework is a paradox: learning from anomalies seems to contradict usual principles of science and statistics where we seek representative or unbiased samples. We resolve this paradox by placing learning-within-stories into a hypothetico-deductive (Popperian) framework, in which storytelling is a form of exploration of the implications of a hypothesis. This has direct implications for our work as a statistician and a writing coach.

Progress in 2023, Jessica Edition

Since Aki and Andrew are doing it… 

Published:

Unpublished/Preprints:

Performed:

If I had to choose a favorite (beyond the play, of course) it would be the rational agent benchmark paper, discussed here. But I also really like the causal quartets paper. The first aims to increase what we learn from experiments in empirical visualization and HCI through comparison to decision-theoretic benchmarks. The second aims to get people to think twice about what they’ve learned from an average treatment effect. Both have influenced what I’ve worked on since.

What’s up with spring blooming?

 

This post is by Lizzie.

Here’s another media hit I missed; I was asked to discuss why daffodils are blooming now in January. If I could have replied I would have said something like:

(1) Vancouver is a weird mix of cool and mild for a temperate place — so we think plants accumulate their chilling (cool-ish winter temperatures needed before plants can respond to warm temperatures, but just cool — like 2-6 C is a supposed sweet spot) quickly and then a warm snap means they get that warmth they need and they start growing.

This is especially true for plants from other places that likely are not evolved for Vancouver’s climate, like daffodils.

(2) It’s been pretty warm! I bet they flowered because it has been so warm.

Deep insights, I know …. They missed me but luckily they got my colleague Doug Justice to speak and he hit my points. Doug knows plants more than I do. He also calls our cherry timing for our …

International Cherry Prediction Competition

Which is happening again this year!

You should compete! Why? You can win money, and you can help us build better models, because here’s what I would not say on TV:

We all talk about ‘chilling’ and ‘forcing’ in plants, and what we don’t tell you is that we never actually measure the physiological transition between chilling and forcing because… we aren’t sure what it is! Almost all chilling-forcing models are built on scant data where some peaches (mostly) did not bloom when they were planted in warm places 50+ years ago. We need your help!

 

The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled.

Dorothy Bishop has the story about “a chemistry lab in CNRS-Université Sorbonne Paris Nord”:

More than 20 scientific articles from the lab of one principal investigator have been shown to contain recycled and doctored graphs and electron microscopy images. That is, results from different experiments that should have distinctive results are illustrated by identical figures, with changes made to the axis legends by copying and pasting numbers on top of previous numbers. . . . the problematic data are well-documented in a number of PubPeer comments on the articles (see links in Appendix 1 of this document).

The response by CNRS [Centre National de la Recherche Scientifique] to this case . . . was to request correction rather than retraction of what were described as “shortcomings and errors”, to accept the scientist’s account that there was no intentionality, despite clear evidence of a remarkable amount of manipulation and reuse of figures; a disciplinary sanction of exclusion from duties was imposed for just one month.

I’m not surprised. The sorts of people who will cheat on their research are likely to be the same sorts of people who will instigate lawsuits, start media campaigns, and attack in other ways. These are researchers who’ve already shown a lack of scruple and a willingness to risk their careers; in short, they’re loose cannons, scary people, so it can seem like the safest strategy to not try to upset them too much, not trap them into a corner where they’ll fight like trapped rats. I’m not speaking specifically of this CNRS researcher—I know nothing of the facts of this case beyond what’s reported in Bishop’s post—I’m just speaking to the mindset of the academic administrators who would just like the problem to go away so they can get on with their regular jobs.

But Bishop and her colleagues were annoyed. If even blatant examples of scientific misconduct cannot be handled straightforwardly, what does this say about the academic and scientific process more generally? Is science just a form of social media, where people can make any sort of claim and evidence doesn’t matter?

They write:

So what should happen when fraud is suspected? We propose that there should be a prompt investigation, with all results transparently reported. Where there are serious errors in the scientific record, then the research articles should immediately be retracted, any research funding used for fraudulent research should be returned to the funder, and the person responsible for the fraud should not be allowed to run a research lab or supervise students. The whistleblower should be protected from repercussions.

In practice, this seldom happens. Instead, we typically see, as in this case, prolonged and secret investigations by institutions, journals and/or funders. There is a strong bias to minimize the severity of malpractice, and to recommend that published work be “corrected” rather than retracted.

Bishop and her colleagues continue:

One can see why this happens. First, all of those concerned are reluctant to believe that researchers are dishonest, and are more willing to assume that the concerns have been exaggerated. It is easy to dismiss whistleblowers as deluded, overzealous or jealous of another’s success. Second, there are concerns about reputational risk to an institution if accounts of fraudulent research are publicised. And third, there is a genuine risk of litigation from those who are accused of data manipulation. So in practice, research misconduct tends to be played down.

But:

This failure to act effectively has serious consequences:

1. It gives credibility to fictitious results, slowing down the progress of science by encouraging others to pursue false leads. . . . [and] erroneous data pollutes the databases on which we depend.

2. Where the research has potential for clinical or commercial application, there can be direct damage to patients or businesses.

3. It allows those who are prepared to cheat to compete with other scientists to gain positions of influence, and so perpetuate further misconduct, while damaging the prospects of honest scientists who obtain less striking results.

4. It is particularly destructive when data manipulation involves the Principal Investigator of a lab. . . . CNRS has a mission to support research training: it is hard to see how this can be achieved if trainees are placed in a lab where misconduct occurs.

5. It wastes public money from research grants.

6. It damages public trust in science and trust between scientists.

7. It damages the reputation of the institutions, funders, journals and publishers associated with the fraudulent work.

8. Whistleblowers, who should be praised by their institution for doing the right thing, are often made to feel that they are somehow letting the side down by drawing attention to something unpleasant. . . .

What happened next?

It’s the usual bad stuff. They receive a series of stuffy bureaucratic responses, none of which address any of items 1 through 8 above, let alone the problem of the data which apparently have obviously been faked. Just disgusting.

But I’m not surprised. We’ve seen it many times before:

– The University of California’s unresponsive response when informed of research misconduct by their star sleep expert.

– The American Political Science Association refusing to retract an award given to an author for a book with plagiarized material, or even to retroactively have the award shared with the people whose material was copied without acknowledgment.

– The London Times never acknowledging the blatant and repeated plagiarism by its celebrity chess columnist.

– The American Statistical Association refusing to retract an award given to a professor who plagiarized multiple times, including from wikipedia (in an amusing case where he created negative value by introducing an error into the material he’d copied, so damn lazy that he couldn’t even be bothered to proofread his pasted material).

– Cornell University . . . ok they finally canned the pizzagate dude, but only after emitting some platitudes. Kind of amazing that they actually moved on that one.

– The Association for Psychological Science: this one’s personal for me, as they ran an article that flat-out lied about me and then refused to correct it just because, hey, they didn’t want to.

– Lots and lots of examples of people finding errors or fraud in published papers and journals refusing to run retractions or corrections or even to publish letters pointing out what went wrong.

Anyway, this is one more story.

What gets my goat

What really annoys me in these situations is how the institutions show loyalty to the people who did research misconduct. When researcher X works at or publishes with institution Y, and it turns out that X did something wrong, why does Y so often try to bury the problem and attack the messenger? Y should be mad at X; after all, it’s X who has leveraged the reputation of Y for his personal gain. I’d think that the leaders of Y would be really angry at X, even angrier than people from the outside. But it doesn’t happen that way. The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled. I’m sure that Dan Davies would have something to say about all this.

In some cases academic misconduct doesn’t deserve a public apology

This is Jessica. As many of you probably saw, Claudine Gay resigned as president of Harvard this week. Her tenure as president is apparently the shortest on record, and accusations of plagiarism involving some of her published papers and her dissertation seem to be a major contributor that pushed this decision, after the initial backlash against Gay’s response alongside MIT and Penn presidents Kornbluth and Magill to questions from Republican congresswoman Stefanik about blatantly anti-semitic remarks on their campuses in the wake of Oct. 7.

The plagiarism counts are embarrassing for Gay and for Harvard, for sure, as were the very legalistic reactions of all three presidents when asked about anti-semitism on their campuses. In terms of plagiarism as a phenomena that crops up in academia, I agree with Andrew that it tells us something about the author’s lack of ability or effort to take the time to understand the material. I suspect it happens a lot under the radar, and I see it as a professor (more often with ChatGPT in the mix and no, it does not always lead to explicit punishment, to comment on what some are saying online about double standards for faculty and students). What I don’t understand is how in Gay’s case this is misconduct at the level that warrants a number of headline stories in major mainstream news media and the resignation of an administrator who has put aside her research career anyway. 

On the one hand, I can see how it is temptingly easy to rationalize why the president of what is probably the most revered university on earth cannot be associated with any academic misconduct without somehow bringing shame on the institution. She’s the president of Harvard, how can it not be shocking?! is one narrative I suppose. But, this kind of response to this situation is exactly what bothers me in the wake of her resignation. I will try to explain.

Regarding the specifics, I saw a few of the plagarized passages early on, and I didn’t see much reason to invest my time in digging further, if this was the best that could be produced by those who were obviously upset about it (I agree with Phil here that they seem like a  “weak” form of plagiarism). What makes me uncomfortable about this situation was how so many people, under the guise of being “objective,” did feel the need to invest their time in the name of establishing some kind of truth in the situation. This is the moment of decision that I wish to call attention to. It’s as though in the name of being “neutral” and “evidence based” we are absolved from having to consider why we feel so compelled in certain cases to get to the bottom of it, but not so much in other cases.  

It’s the same thing that makes so much research bad: the inability to break frame, to turn on the premise rather than the minor details. To ask, how did we get here? Why are we all taking for granted that this is the thing to be concerned with? 

Situations like what happened to Gay bring a strong sense of deja vu for me. I’m not sure how much my personal reaction is related to being female in a still largely male-dominated field myself, but I suspect it contributes. There’s a scenario that plays out from time to time where someone who is not in the majority in some academic enterprise is found to have messed up. At first glance, it seems fairly minor, somewhat relatable at least, no worse than what many others have done. But, somehow, it can’t be forgotten in some cases. Everyone suddenly exerts effort they would normally have trouble producing for a situation that doesn’t concern them that much personally to pore over the details with a fine-tooth comb to establish that there really was some fatal flaw here. The discussion goes on and becomes hard to shut out, because here is always someone else who is somehow personally offended by it. And the more it gets discussed, the more it seems like overwhelmingly a real thing to be dealt with, to be decided. It becomes an example for the sake of being principled. Once this palpable sense that ‘this is important’, ‘this is a message about our principles,’ sets in, then the details cannot be overlooked. How else can we be sure we are being rational and objective? We have to treat it like evidence and bring to bear everything we know about scrutinizing evidence. 

What is hard for me to get over is that these stories that stick around and capture so much attention are far more often stories about some member of the racial or gender non-majority who ended up in a high place. It’s like the resentment that a person from the outside has gotten in sets in without the resenter even becoming aware of it, and suddenly a situation that seems like it should have been cooperative gets much more complicated. This is not to say that people who are in the majority in a field don’t get called out or targetted sometimes, they do. Just that there’s a certain dynamic that seems to set in more readily when someone perceived as not belonging to begin with messes up. As Jamelle Watson-Daniels writes on X/Twitter of the Gay situation: “the legacy and tradition of orchestrated attacks against the credibility of Black scholars all in the name of haunting down and exposing them as… the ultimate imposters.” This is the undertone I’m talking about here.

I’ve been a professor for about 10 years, and I’ve seen this sort of hyper-attention turned on women and/or others in the non-majority who violated some minor code repeatedly in that time. In many instances, it creates a situation that divides those who are confused by the apparent level of detail orientedness given the crime and those who can’t see how there is any other way than to make the incident into an example. Gay is just the most recent reminder. 

What makes this challenging for me to write about personally is that I am a big believer in public critique, and admitting one’s mistakes. I have advocated for both on this blog. To take an example that comes up from time to time, I don’t think that because of uneven power dynamics, public critique of papers with lead student authors should be shut down, or that we owe authors extensive private communications before we openly criticize. That goes against the sort of open discussion of research flaws that we are already often incentivized to avoid. For the same reason, I don’t think that critiques made by people with ulterior motives should be dismissed. I think there were undoubtedly ulterior motives here, and I am not arguing that the information about accounts of plagiarism here should not have been shared at all. 

I also think making decisions driven by social values (which often comes up under the guise of DEI) is very complex. At least in academic computer science, we seem to be experiencing a moment of heightened sensitivity to what is perceived “moral” and “ethical”, and that often these things are defined very simplistically and tolerance for disagreement low. 

And I also think that there are situations where a transgression may seem minor but it is valuable to mind all the details and use it as an example! I was surprised for example at how little interest there seemed to be in the recent Nature Human Behavior paper which claimed to present all confirmatory analyses but couldn’t produce the evidence that the premise of the paper suggests should be readily available. This seemed to me like an important teachable moment given what the paper was advocating to begin with.  

So anyway, lots of reasons why this is hard to write about, and lots of fodder for calling me a hypocrite if you want. But I’m writing this post because the plagiarism is clearly not be the complete story here. I don’t know the full details of the Gay investigation (and admit I haven’t spent too much time researching this: I’ve seen a bunch of the plagiarism examples, but I don’t have a lot of context on her entire career). So it’s possible I’m wrong and she did some things that were truly more awful than the average Harvard president. But I haven’t heard about them yet. And either way my point still stands: there are situations with similar dynamics to this where my dedication to scientific integrity and public critique and getting to the bottom of technical details do not disappear, but are put on the backburner to question a bigger power dynamic that seems off. 

And so, while I normally I think everyone caught doing academic misconduct should acknowledge it, for the reasons above, at least at the moment, it doesn’t bother me that Gay’s resignation letter doesn’t mention the plagiarism. I think not acknowledging it was the right thing to do. 

Progress in 2023

Published:

Unpublished:

Enjoy.

Clarke’s Law, and who’s to blame for bad science reporting

Lizzie blamed the news media for a horrible bit of news reporting on the ridiculous claim that “the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction.” The press got conned by a press release from a sleazy company, which in this case was “a Silicon Valley startup” but in other settings could be a pollster or a car company or a university public relations office or an advocacy group or some other institution that has a quasi-official role in our society.

Lizzie was rightly ticked off by the media organizations that were happily playing the “sucker” role in this drama, with CNN straight-up going with the press release, along with a fawning treatment of the company that was pushing the story, and NPR going with a mildly skeptical amused tone, interviewing an actual outside expert but still making the mistake of taking the story seriously rather than framing it as a marketing exercise.

We’ve seen this sort of credulous reporting before, perhaps most notably with Theranos and the hyperloop. It’s not just that the news media are suckers, it’s that being a sucker—being credulous—is in many cases a positive for a journalist. A skeptical reporter will run fewer stories, right? Malcolm Gladwell and the Freakonomics team are superstars, in part because they’re willing to routinely turn off whatever b.s. detectors they might have, in order to tell good stories. They get rewarded for their practice of promoting unfounded claims. If we were to imagine an agent-based model of the news media, these are the agents that flow to the top. One could suppose a different model, in which mistakes tank your reputation, but that doesn’t seem to be the world in which we operate.

So, yeah, let’s get mad at the media, first for this bogus champagne story and second for using this as an excuse to promote a bogus company.

Also . . .

Let’s get mad at the institutions of academic science, which for years have been unapologetically promoting crap like himmicanes, air rage, ages ending in 9, nudges, and, let’s never forget, the lucky golf ball.

In terms of wasting money and resources, I don’t think any of those are as consequential as business scams such as Theranos or hyperloop; rather, they bother me because they’re coming from academic science, which might traditionally be considered a more trustworthy source.

And this brings us to Clarke’s law, which you may recall is the principle that any sufficiently crappy research is indistinguishable from fraud.

How does that apply here? I can only assume that the researchers behind the studies of himmicanes, air rage, ages ending in 9, nudges, the lucky golf ball, and all the rest, are sincere and really believe that their claims are supported by their data. But there have been lots of failed replications, along with methodological and statistical explanations of what went wrong in those studies. At some point, to continue to promote them is, in my opinion, on the border of fraud: it requires willfully looking away from contrary evidence and, at the extreme, leads to puffed-up-rooster claims such as, “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

In short, the corruption involved in the promotion of academic science has poisoned the well and facilitated the continuing corruption of the news media by business hype.

I’m not saying that business hype and media failure are the fault of academic scientists. Companies would be promoting themselves, and these lazy news organizations would be running glorified press releases, no matter what we were to do in academia. Nor, for that matter, are academics responsible for credulity on stories such as UFO space aliens. The elite news media seems to be able to do this all on its own.

I just don’t think that academic science hype is helping with the situation. Academic science hype helps to set up the credulous atmosphere.

Michael Joyner made a similar point a few years ago:

Why was the Theranos pitch so believable in the first place? . . .

Who can forget when James Watson. . . . co-discoverer of the DNA double helix, made a prediction in 1998 to the New York Times that so-called VEGF inhibitors would cure cancer in “two years”?

At the announcement of the White House Human Genome Project in June 2000, both President Bill Clinton and biotechnologist Craig Venter predicted that cancer would be vanquished in a generation or two. . . .

That was followed in 2005 by the head of the National Cancer Institute, Andrew von Eschenbach, predicting the end of “suffering and death” from cancer by 2015, based on a buzzword bingo combination of genomics, informatics, and targeted therapy.

Verily, the life sciences arm of Google, generated a promotional video that has, shall we say, some interesting parallels to the 2014 TedMed talk given by Elizabeth Holmes. And just a few days ago, a report in the New York Times on the continuing medical records mess in the U.S. suggested that with better data mining of more coherent medical records, new “cures” for cancer would emerge. . . .

So, why was the story of Theranos so believable in the first place? In addition to the specific mix of greed, bad corporate governance, and too much “next” Steve Jobs, Theranos thrived in a biomedical innovation world that has become prisoner to a seemingly endless supply of hype.

Joyner also noted that science hype was following patterns of tech hype. For example, this from Dr. Eric Topol, director of the Scripps Translational Science Institute:

When Theranos tells the story about what the technology is, that will be a welcome thing in the medical community. . . . I tend to believe that Theranos is a threat.

The Scripps Translational Science Institute is an academic, or at least quasi-academic, institution! But they’re using tech hype disrupter terminology by calling scam company Theranos a “threat” to the existing order. I have no reason to think that the director of the Scripps Translational Science Institute himself committing fraud? I have no reason to think so. What I do think is that he wants to have it both ways. When Theranos was riding high, he hyped it and called it a “threat” (again, that’s a positive adjective in this context). Later, after the house of cards fell, he wrote, “I met Holmes twice and conducted a video interview with her in 2013. . . . Like so many others, I had confirmation bias, wanting this young, ambitious woman with a great idea to succeed. The following year, in an interview with The New Yorker, I expressed my deep concern about the lack of any Theranos transparency or peer-reviewed research.” Actually, though, here’s what he said to the New Yorker: “I tend to believe that Theranos is a threat. But if I saw data in a journal, head to head, I would feel a lot more comfortable.” Sounds to me less like deep concern and more like hedging his bets.

Caught like a deer in the headlights between skepticism and fomo.

Extinct Champagne grapes? I can be even more disappointed in the news media

Happy New Year. This post is by Lizzie.

Over the end-of-year holiday period, I always get the distinct impression that most journalists are on holiday too. I felt this more acutely when I found an “urgent” media request in my inbox when I returned to it after a few days away. Someone at a major reputable news outlet wrote:

We are doing a short story on how the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction. We were hoping to do a quick interview with you on the topic….Our deadline is asap, as we plan to run this story on New Years.

It was late on 30 December so I had missed helping them but still had to reply that I hoped that found some better information because ‘the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction’ was not good information in my not-so-entirely-humble opinion as I study this and can think of zero-zilch-nada evidence to support this.

This sounded like insane news I would expect from more insane media outlets. I tracked down what I assume was the lead they were following (see here), and found it seems to relate to some AI start-up I will not do the service of mentioning that is just looking for more press. They seem to put out splashy sounding agricultural press releases often — and so they must have put out one about Champagne grapes being on the brink of extinction to go with New Year’s.

I am on a bad roll with AI just now, or — more exactly — the intersection of human standards and AI. There’s no good science that “the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction.” The whole idea of this is offensive to me when human actions are actually driving species extinct. And it ignores tons of science on winegrapes and the reality that they’re pretty easy to grow (growing excellent ones? Harder). So, poor form on the part of the zero-standards-for-our-science AI startup. But I am more horrified by the media outlets that cannot see through this. I am sure they’re inundated with lots of crazy bogus stories every day, but I thought that their job was to report on ones that matter and they hopefully have some evidence are true.

What did they do instead of that? They gave a platform to a “a highly adaptable marketing manager and content creator” to talk about some bogus “study” and a few soundbites to a colleague of mine who actually knew the science (Ben Cook from NASA).

Judgments versus decisions

This is Jessica. A paper called “Decoupling Judgment and Decision Making: A Tale of Two Tails” by Oral, Dragicevic, Telea, and Dimara showed up in my feed the other day. The premise of the paper is that when people interact with some data visualization, their accuracy in making judgments might conflict with their accuracy in making decisions from the visualization. Given that the authors appear to be basing the premise in part on results from a prior paper on decision making from uncertainty visualizations I did with Alex Kale and Matt Kay, I took a look. Here’s the abstract:

Is it true that if citizens understand hurricane probabilities, they will make more rational decisions for evacuation? Finding answers to such questions is not straightforward in the literature because the terms “judgment” and “decision making” are often used interchangeably. This terminology conflation leads to a lack of clarity on whether people make suboptimal decisions because of inaccurate judgments of information conveyed in visualizations or because they use alternative yet currently unknown heuristics. To decouple judgment from decision making, we review relevant concepts from the literature and present two preregistered experiments (N=601) to investigate if the task (judgment vs. decision making), the scenario (sports vs. humanitarian), and the visualization (quantile dotplots, density plots, probability bars) affect accuracy. While experiment 1 was inconclusive, we found evidence for a difference in experiment 2. Contrary to our expectations and previous research, which found decisions less accurate than their direct-equivalent judgments, our results pointed in the opposite direction. Our findings further revealed that decisions were less vulnerable to status-quo bias, suggesting decision makers may disfavor responses associated with inaction. We also found that both scenario and visualization types can influence people’s judgments and decisions. Although effect sizes are not large and results should be interpreted carefully, we conclude that judgments cannot be safely used as proxy tasks for decision making, and discuss implications for visualization research and beyond. Materials and preregistrations are available at https://osf.io/ufzp5/?view only=adc0f78a23804c31bf7fdd9385cb264f. 

There’s a lot being said here, but they seem to be getting at a difference between forming accurate beliefs from some information and making a good (e.g., utility optimal) decision. I would agree there are slightly different processes. But they are also claiming to have a way of directly comparing judgment accuracy to decision accuracy. While I appreciate the attempt to clarify terms that are often overloaded, I’m skeptical that we can meaningfully separate and compare judgments from decisions in an experiment. 

Some background

Let’s start with what we found in our 2020 paper, since Oral et al base some of their questions and their own study setup on it. In that experiment we’d had people make incentivized decisions from displays that varied only how they visualized the decision-relevant probability distributions. Each one showed a distribution of expected scores in a fantasy sports game for a team with and without the addition of a new player. Participants had to decide whether to pay for the new player or not in light of the cost of adding the player, the expected score improvement, and the amount of additional monetary award they won when they scored above a certain number of points. We also elicited a (controversial) probability of superiority judgment: What do you think is the probability your team will score more points with the new player than without? In designing the experiment we held various aspects of the decision problem constant so that only the ground truth probability of superiority was varying between trials. So we talked about the probability judgment as corresponding to the decision task.

However, after modeling the results we found that depending on whether we analyzed results from the probability response question or the incentivized decision, the ranking of visualizations changed. At the time we didn’t have a good explanation for this disparity between what was helpful for doing the probability judgment versus the decision, other than maybe it was due to the probability judgment not being directly incentivized like the decision response was. But in a follow-up analysis that applied a rational agent analysis framework to this same study, allowing us to separate different sources of performance loss by calibrating the participants’ responses for the probability task, we saw that people were getting most of the decision-relevant information regardless of which question they were responding to; they just struggled to report it for the probability question. So we concluded that the most likely reason for the disparity between judgment and decision results was probably that the probability of superiority judgment was not the most intuitive judgment to be eliciting – if we really wanted to elicit the beliefs directly corresponding to the incentivized decision task, we should have asked them for the difference in the probability of scoring enough points to win the award with and without the new player. But this is still just speculation, since we still wouldn’t be able to say in such a setup how much the results were impacted by only one of the responses being incentivized. 

Oral et al. gloss over this nuance, interpreting our results as finding “decisions less accurate than their direct-equivalent judgments,” and then using this as motivation to argue that “the fact that the best visualization for judgment did not necessarily lead to better decisions reveals the need to decouple these two tasks.” 

Let’s consider for a moment by what means we could try to eliminate ambiguity in comparing probability judgments to the associated decisions. For instance, if only incentivizing one of the two responses confounds things, we might try incentivizing the probability judgment with its own payoff function, and compare the results to the incentivized decision results. Would this allow us to directly study the difference between judgments and decision-making? 

I argue no. For one, we would need to use different scoring rules for the two different types of response, and things might rank differently depending on the rule (not to mention one rule might be easier to optimize under). But on top of this, I would argue that once you provide a scoring rule for the judgment question, it becomes hard to distinguish that response from a decision by any reasonable definition. In other words, you can’t eliminate confounds that could explain a difference between “judgment” and “decision” without turning the judgment into something indistinguishable from a decision. 

What is a decision? 

The paper by Oral et al. describes abundant confusion in the literature about the difference between judgment and decision-making, proposing that “One barrier to studying decision making effectively is that judgments and decisions are terms not well-defined and separated.“ They criticize various studies on visualizations for claiming to study decisions when they actually study judgments. Ultimately they describe their view as:

In summary, while decision making shares similarities with judgment, it embodies four distinguishing features: (I) it requires a choice among alternatives, implying a loss of the remaining alternatives, (II) it is future-oriented, (III) it is accompanied with overt or covert actions, and (IV) it carries a personal stake and responsibility for outcomes. The more of these features a judgment has, the more “decision-like” it becomes. When a judgment has all four features, it no longer remains a judgment and becomes a decision. This operationalization offers a fuzzy demarcation between judgment and decision making, in the sense that it does not draw a sharp line between the two concepts, but instead specifies the attributes essential to determine the extent to which a cognitive process is a judgment, a decision, or somewhere in-between [58], [59].

This captures components of other definitions of decision I’ve seen in research related to evaluating interfaces, e.g., as a decision as “a choice between alternatives,” typically involving “high stakes.” However, like these other definitions, I don’t think Oral et al.’s definition very clearly differentiates a decision from other forms of judgment. 

Take the “personal stake and responsibility for outcomes” part. How do we interpret this given that we are talking about subjects in an experiment, not decisions people are making in some more naturalistic context?    

In the context of an experiment, we control the stakes and one’s responsibility for their action via a scoring rule. We could instead ask people to imagine making some life or death decision in our study and call it high stakes, as many researchers do. But they are in an experiment, and they know it. In the real world people have goals, but in an experiment you have to endow them

So we should incentivize the question to ensure participants have some sense of the consequences associated with what they decide. We can ask them to separately report their beliefs, e.g., what they perceive some decision-relevant probability to be as we did in the 2020 study. But if we want to eliminate confounds between the decision and the judgment, we should incentivize the belief question too, ideally with a proper scoring rule so that it’s in their best interest to tell me the truth. Now both our decision task and our judgment task, from the standpoint of the experiment subject, would both seem to have some personal stake. So we can’t distinguish the decision easily based on its personal stakes.

Oral et al. might argue that the judgment question is still not a decision, because there are three other criteria for a decision according to their definition. Considering (I), will asking for a person’s belief require them to make a choice between alternatives? Yes, it will. Because any format we elicit their response in will naturally constrain it. Even if we just provide a text box to type in a number between 0 and 1, we’re going to get values rounded at some decimal place. So it’s hard to use “a choice among alternatives” as a distinguishing criteria in any actual experiment. 

What about (II), being future-oriented? Well, if I’m incentivizing the question then it will be just as future-oriented as my decision is, in that my payoff depends on my response and the ground truth, which is unknown to me until after I respond.

When it comes to (III), overt or covert actions, as in (I), in any actual experiment, my action space will be some form of constrained response space. It’s just that now my action is my choice of which beliefs to report. The action space might be larger, but again there is no qualitative difference between choosing what beliefs to report and choosing what action to report in some more constrained decision problem.

To summarize, by trying to put judgments and decisions on equal footing by scoring both, I’ve created something that seems to achieve Oral et al.’s definition of decision. While I do think there is a difference between a belief and a decision, I don’t think it’s so easy to measure these things without leaving open various other explanations for why the responses differ.

In their paper, Oral et al. sidestep incentivizing participants directly, assuming they will be intrinsically motivated. They report on two experiments where they used a task inspired by our 2020 paper (showing visualizations of expected score distributions and asking, Do you want the team with or without the new player, where the participant’s goal is to win a monetary award that requires scoring a certain number of points). Instead of incentivizing the decision by using the scoring rule to incentivize participants, they told them to try to be accurate. And instead of eliciting the corresponding probabilistic beliefs for the decision, they asked them two questions: Which option (team) is better?, and Which of the teams do you choose? They interpret the first answer as the judgment and the second as the decision. 

I can sort of see what they are trying to do here, but this seems like essentially the same task to me. Especially if you assume people are intrinsically motivated to be accurate and plan to evaluate responses using the same scoring rule, as they do. Why would we expect a difference between these two responses? To use a different example that came up in a discussion I was having with Jason Hartline, if you imagine a judge who cares only about doing the right thing (convicting the guilty and acquitting the innocent), who must decide whether to acquit or convict a defendant, why would you expect a difference (in accuracy) when you ask them ‘Is he guilty’ versus ‘Will you acquit or convict?’ 

In their first experiment using this simple wording, Oral et al. find no difference between responses to the two questions. In a second experiment they slightly changed the wording of the questions to emphasize that one was “your judgment” and one was “your decision.” This leads to what they say is suggestive evidence that people’s decisions are more accurate than their judgments. I’m not so sure.

The takeway

It’s natural to conceive of judgments or beliefs as being distinct from decisions. If we subscribe to a Bayesian formulation of learning from data, we expect the rational person to form beliefs about the state of the world and then choose the utility maximizing action given those beliefs. However, it is not so natural to try to directly compare judgments and decisions on equal footing in an experiment. 

More generally, when it comes to evaluating human decision-making (what we generally want to do in research related to interfaces) we gain little by preferring colloquial verbal definitions over the formalisms of statistical decision theory, which provide tools designed to evaluate people’s decisions ex-ante. It’s much easier to talk about judgment and decision-making when we have a formal way of representing a decision problem (i.e., state space, action space, data-generating model, scoring rule), and a shared understanding of what the normative process of learning from data to make a decision is (i.e., start with prior beliefs, update them given some signal, choose the action that maximizes your expected score under the data-generating model). In this case, we could get some insight into how judgments and decisions can differ simply by considering the process implied by expected utility theory. 

John Mandrola’s tips for assessing medical evidence

Ben Recht writes:

I’m a fan of physician-blogger John Mandrola. He had a nice response to your blog, using it as a jumping-off point for a short tutorial on his rather conservative approach to medical evidence assessment.

John is always even-tempered and constructive, and I thought you might enjoy this piece as an “extended blog comment.” I think he does a decent job answering the question at hand, and his approach to medical evidence appraisal is one I more or less endorse.

My post in question was called, How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer, and I concluded with this bit of helplessness: “I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature. I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.”

In his response, “Simple Rules to Understand Medical Claims,” Mandrola offers some tips:

The most important priors when it comes to medical claims are simple: most things don’t work. Most simple answer answers are wrong. Humans are complex. Diseases are complex. Single causes of complex diseases like cancer should be approached with great skepticism.

One of the studies sent to Gelman was a small trial finding that Vitamin D effectively treated COVID-19. The single-center open-label study enrolled 76 patients in early 2020. Even if this were the only study available, the evidence is not strong enough to move our prior beliefs that most simple things (like a Vitamin D tablet) do not work.

The next step is a simple search—which reveals two large randomized controlled trials of Vitamin D treatment for COVID-19, one published in JAMA and the other in the BMJ. Both were null.

You can use the same strategy for evaluating the claim that fish oil supplementation leads to higher rates of prostate cancer.

Start with prior beliefs. How is it possible that one exposure increases the rate of a disease that mostly affects older men? Answer: it’s not very possible. . . .

Now consider the claims linked in Gelman’s email.

– Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial

– Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial

While both studies stemmed from randomized trials neither were primary analyses. These were association studies using data from the main trial, and therefore, we should be cautious in making causal claims.

Now go to Google. This reveals two large randomized controlled trials of fish oil vs placebo therapy.

– The ASCEND trial of n-3 fatty acids in 15k patients with diabetes found “no significant between-group differences in the incidence of fatal or nonfatal cancer either overall or at any particular body site.” And I would add no difference in all-cause death.

– The VITAL trial included cancer as a primary endpoint. More than 25k patients were randomized. The conclusions: “Supplementation with n−3 fatty acids did not result in a lower incidence of major cardiovascular events or cancer than placebo.”

Mandrola concludes:

I am not arguing that every claim is simple. My case is that the evaluation process is slightly less daunting than Professor Gelman seems to infer.

Of course, medical science can be complicated. Content expertise can be important. . . .

But that does not mean we should take the attitude: “I have no idea what to think about these papers.”

I offer five basic rules of thumb that help in understanding medical claims:

1. Hold pessimistic priors

2. Be super-cautious about causal inferences from nonrandom observational comparisons

3. Look for big randomized controlled trials—and focus on their primary analyses

4. Know that stuff that really works is usually obvious (antibiotics for bacterial infection; AEDs to convert VF)

5. Respect uncertainty. Stay humble about most “positive” claims.

This all makes sense, as long as we recognize that randomized controlled trials are themselves nonrandom observational comparisons: the people in the study won’t in general be representative of the population of interest, also issues such as dropout, selection bias, realism of treatments, etc., which can be huge in medical trials. Experimentation is great; we just need to avoid the pitfalls of (a) idealizing studies that have randomization (we should avoid making the “chain is as strong as its strongest link” fallacy) and (b) disparaging observational data without assessing its quality.

For our discussion here, the most relevant bit of Mandrola’s advice was this from the comment thread:

Why are people going to a Political Scientist for medical advice? That is odd.

I hope Prof Gelman’s answer was based on a recognition that he doesn’t have the context and/or the historical background to properly interpret the studies.

The answer is: Yes, I do recognize my ignorance! Here’s what I wrote in the above-linked post:

I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers.

Mandrola’s advice given above seems reasonable to me. But it can be hard for me to apply in that he’s assuming a background medical knowledge that I don’t have. On the other hand, when it comes to social science, I know a lot. For example, when I saw that claim that women during a certain time of the month were 20 percentage points more likely to vote for Barack Obama, it was immediately clear this was ridiculous, because public opinion just doesn’t change that much. This had nothing to do with randomized trials or observational comparisons or anything like that; it was just too noisy of a study to learn anything.

In judo, before you learn the cool moves, you first have to learn how to fall. Maybe we should be training researchers the same way: first learn how things can go wrong, and only when you get that lesson down do you learn the fancy stuff.

I want to follow up on a suggestion from a few years ago:

In judo, before you learn the cool moves, you first have to learn how to fall. Maybe we should be training researchers, journalists, and public relations professionals the same way. First learn about Judith Miller and Thomas Friedman, and only when you get that lesson down do you get to learn about Woodward and Bernstein.

Martha in comments modified my idea:

Yes! But I’m not convinced that “First learn about Judith Miller and Thomas Friedman, and only when you get that lesson down do you get to learn about Woodward and Bernstein” or otherwise learning about people is the way to go. What is needed is teaching that involves lots of critiquing (especially by other students), with the teacher providing guidance (e.g., criticize the work or the action, not the person; no name calling; etc.) so students learn to give and accept criticism as a normal part of learning and working.

I responded:

Yes, learning in school involves lots of failure, getting stuck on homeworks, getting the wrong answer on tests, or (in grad school) having your advisor gently tone down some of your wild research ideas. Or, in journalism school, I assume that students get lots of practice in calling people and getting hung up on.

So, yes, students get the experience of failure over and over. But the message we send, I think, is that once you’re a professional it’s just a series of successes.

Another commenter pointed to this inspiring story from psychology researchers Brian Nosek, Jeffrey Spies, and Matt Motyl, who ran an experiment, thought they had an exciting result, but, just to be sure, they tried a replication and found no effect. This is a great example of how to work and explore as a scientist.

Background

Scientific research is all about discovery of the unexpected: to do research, you need to be open to new possibilities, to design experiments to force anomalies, and to learn from them. The sweet spot for any researcher is at Cantor’s corner.

Buuuut . . . researchers are also notorious for being stubborn. In particular, here’s a pattern we see a lot:
– Research team publishes surprising result A based on some “p less than .05” empirical results.
– This publication gets positive attention and the researchers and others in their subfield follow up with open-ended “conceptual replications”: related studies that also attain the “p less than .05” threshold.
– Given the surprising nature of result A, it’s unsurprising that other researchers are skeptical of A. The more theoretically-minded skeptics, or agnostics, demonstrate statistical reasons why these seemingly statistically-significant results can’t be trusted. The more empirically-minded skeptics, or agnostics, run preregistered replications studies, which fail to replicate the original claim.
– At this point, the original researchers do not apply the time-reversal heuristic and conclude that their original study was flawed (forking paths and all that). Instead they double down, insist their original findings are correct, and they come up with lots of little explanations for why the replications aren’t relevant to evaluating their original claims. And they typically just ignore or brush aside the statistical reasons why their original study was too noisy to ever show what they thought they were finding.

I’ve conjectured that one reason scientists often handle criticism in such scientifically-unproductive ways is . . . the peer-review process, which goes like this:

As scientists, we put a lot of effort into writing articles, typically with collaborators: we work hard on each article, try to get everything right, then we submit to a journal.

What happens next? Sometimes the article is rejected outright, but, if not, we’ll get back some review reports which can have some sharp criticisms: What about X? Have you considered Y? Could Z be biasing your results? Did you consider papers U, V, and W?

The next step is to respond to the review reports, and typically this takes the form of, We considered X, and the result remained significant. Or, We added Y to the model, and the result was in the same direction, marginally significant, so the claim still holds. Or, We adjusted for Z and everything changed . . . hmmmm . . . we then also though about factors P, Q, and R. After including these, as well as Z, our finding still holds. And so on.

The point is: each of the remarks from the reviewers is potentially a sign that our paper is completely wrong, that everything we thought we found is just an artifact of the analysis, that maybe the effect even goes in the opposite direction! But that’s typically not how we take these remarks. Instead, almost invariably, we think of the reviewers’ comments as a set of hoops to jump through: We need to address all the criticisms in order to get the paper published. We think of the reviewers as our opponents, not our allies (except in the case of those reports that only make mild suggestions that don’t threaten our hypotheses).

When I think of the hundreds of papers I’ve published and the, I dunno, thousand or so review reports I’ve had to address in writing revisions, how often have I read a report and said, Hey, I was all wrong? Not very often. Never, maybe?

Where we’re at now

As scientists, we see serious criticism on a regular basis, and we’re trained to deal with it in a certain way: to respond while making minimal, ideally zero, changes to our scientific claims.

That’s what we do for a living; that’s what we’re trained to do. We think of every critical review report as a pain in the ass that we have to deal with, not as a potential sign that we screwed up.

So, given that training, it’s perhaps little surprise that when our work is scrutinized in post-publication review, we have the same attitude: the expectation that the critic is nitpicking, that we don’t have to change our fundamental claims at all, that if necessary we can do a few supplemental analyses and demonstrate the robustness of our findings to those carping critics.

How to get to a better place?

How can this situation be improved? I’m not sure. In some ways, things are getting better: the replication crisis has happened, and students and practitioners are generally aware that high-profile, well-accepted findings often do not replicate. In other ways, though, I fear we’re headed in the wrong direction: students are now expected to publish peer-reviewed papers throughout grad school, so right away they’re getting on the minimal-responses-to-criticism treadmill.

It’s not clear to me how to best teach people how to fall before they learn fancy judo moves in science.

Statistical Practice as Scientific Exploration (my talk on 4 Mar 2024 at the Royal Society conference on the promises and pitfalls of preregistration)

Here’s the conference announcement:

Discussion meeting organised by Dr Tom Hardwicke, Professor Marcus Munafò, Dr Sophia Crüwell, Professor Dorothy Bishop FRS FMedSci, Professor Eric-Jan Wagenmakers.

Serious concerns about research quality have provoked debate across scientific disciplines about the merits of preregistration — publicly declaring study plans before collecting or analysing data. This meeting will initiate an interdisciplinary dialogue exploring the epistemological and pragmatic dimensions of preregistration, identifying potential limits of application, and developing a practical agenda to guide future research and optimise implementation.

And here’s the title and abstract of my talk, which is scheduled for 14h10 on Mon 4 Mar 2024:

Statistical Practice as Scientific Exploration

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? Researchers when using and developing statistical methods can be seen to be acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modelling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formerly tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow.

I won’t really be talking about preregistration, in part because I’ve already said so much on that topic here on this blog; see for example here and various links at that post. Instead I’ll be talking about the statistical workflow, which is typically presented as a set of procedures applied to data but which I think is more like a process of scientific exploration and discovery. I addressed some of these ideas in this talk from a couple years ago. But, don’t worry, I’m sure I’ll have lots of new material. Not to mention all the other speakers at the conference.

Explainable AI works, but only when we don’t need it

This is Jessica. I went to NeurIPS last week, mostly to see what it was like. While waiting for my flight home at the airport I caught a talk that did a nice job of articulating some fundamental limitations with attempts to make deep machine learning models “interpretable” or “explainable.”

It was part of an XAI workshop. My intentions in checking out the XAI workshop were not entirely pure, as its an area I’ve been skeptical of for a while. Formalizing aspects of statistical communication is very much in line with my interests, but I tried and failed to get into XAI and related work on interpretability a few years ago when it was getting popular. The ML contributions have always struck me as more of an academic exercise than a real attempt at aligning human expectations with model capabilities. When human-computer interaction people started looking into it, there started to be a little more attention to how people actually use explanations, but the methods used to study human reliance on explanations there have not been well grounded (e.g., ‘appropriate reliance’ is often defined as agreeing with the AI when it’s right and not agreeing when it’s wrong, which can be shown to be incoherent in various ways). 

The talk, by Ulrike Luxburg, which gave a sort of impossibility result for explainable AI, was refreshing. First, she distinguished two very different scenarios for explanation: the cooperative ones where you have a principal with a model furnishing the explanations and a user using them who both want the best quality/most accurate explanations, versus adversarial scenarios where you have a principal whose best interests are not aligned with the goal of accurate explanation. For example, some company who needs to explain why it denied someone a loan has little motivation to explain the actual reason behind that prediction, because it’s not in their best interest to give people fodder to then try to minimally change their features to push the prediction to a different label. Her first point was that there is little value in trying to guarantee good explanations in the adversarial case, because existing explanation techniques (e.g.,for feature attribution like SHAP or LIME) give very different explanations for the same prediction, and the same explanation technique is often highly sensitive to small differences in the function to be explained (e.g., slight changes to parameters in training). There are too many degrees of freedom in terms of selecting among inductive biases so the principal easily produce something faithful by some definition while hiding important information. Hence laws guaranteeing a right to explanation miss this point.

In the cooperative setting, maybe there is hope. But, turns out something like the anthropic principle of statistics operates here: we have techniques that we can show work well in the simple scenarios where we don’t really need explanations, but when we do really need them (e.g., deep neural nets over high dimensional feature spaces) anything we can guarantee is not going to be of much use.

There’s an analogy to clustering: back when unsupervised learning was very hot, everyone wanted guarantees for clustering algorithms but to make them required working in settings where the assumptions were very strong, such that the clusters would be obvious upon inspecting the data. In explainable AI, we have various feature attribution methods that describe which features led to the prediction on a particular instance. SHAP, which borrows Shapley values from game theory to allocate credit among features, is very popular. Typically SHAP provides the marginal contribution of each feature, but Shapley Interaction Values have been proposed to allow for local interaction effects between pairs of features. Luxburg presented a theoretical result from this paper which extends Shapley Interaction Values to n-Shapley Values, which explain individual predictions with interaction terms up to order n given some number of total features d. They are additive in that they always sum to the output of the function we’re trying to explain over all subsets of combinations of variables less than or equal to n. Starting from the original Shapley values (where n=1), n-Shapley Values successively add higher-order variable interactions to the explanations.

The theoretical result shows that n-Shapley Values recover generalized additive models (GAMs), which are GLMs where the outcome depends linearly on smoothed functions of the inputs: g(E[Y] = B_0 = f_1(x_1) + f_2(x_2) + … f_m(x_m). GAMs are considered inherently interpretable, but are also undetermined. For n-Shapley to recover a faithful representation of the function as a GAM, the order of the explanation just needs to be as large or larger than the maximum variable interaction in the model. 

However, GAMs lose their interpretability as we add interactions. When we have large numbers of features, as is typically the case in deep learning, what is the value of the explanation?  We need to look at interactions between all combinatorial subsets of the features. So when simple explanations like standard SHAP are applied to complex functions, you’re getting an average over billions of features, and there’s no reduction to be made that would give you something meaningful. The fact that in the simple setting of a GAM of order 1 we can prove SHAP does the right thing does not mean we’re anywhere close to having “solved” explainability. 

The organizers of the workshop obviously invited this rather negative talk on XAI, so perhaps the community is undergoing self-reflection that will temper the overconfidence I associate with it. Although, the day before the workshop I also heard someone complaining that his paper on calibration got rejected from the same workshop, with an accompanying explanation that it wasn’t about LIME or SHAP. Something tells me XAI will live on.

I guess one could argue there’s still value in taking a pragmatic view, where if we find that explanations of model predictions, regardless of how meaningless, lead to better human decisions in scenarios where humans must make the final decision regardless of the model accuracy (e.g., medical diagnoses, loan decisions, child welfare cases), then there’s still some value in XAI. But who would want to dock their research program on such shaky footing? And of course we still need an adequate way of measuring reliance, but I will save my thoughts on that for another post.

 Another thing that struck me about the talk was a kind of tension around just trusting one’s instincts that something is half-baked versus taking the time to get to the bottom of it. Luxburg started by talking about how her strong gut feeling as a theorist was that trying to guarantee AI explainability was not going to be possible. I believed her before she ever got into the demonstration, because it matched my intuition. But then she spent the next 30 minutes discussing an XAI paper. There’s a decision to be made sometimes, about whether to just trust your intuition and move on to something that you might still believe in versus to stop and articulate the critique. Others might benefit from the latter, but then you realize you just spent another year trying to point out issues with a line of work you stopped believing in a long time ago. Anyway, I can relate to that. (Not that I’m complaining about the paper she presented – I’m glad she took the time to figure it out as it provides a nice example). 

I was also reminded of the kind of awkward moment that happens sometimes where someone says something rather final and damning, and everyone pauses for a moment to listen to it. Then the chatter starts right back up again like it was never said. Gotta love academics!