What’s up with spring blooming?

 

This post is by Lizzie.

Here’s another media hit I missed; I was asked to discuss why daffodils are blooming now in January. If I could have replied I would have said something like:

(1) Vancouver is a weird mix of cool and mild for a temperate place — so we think plants accumulate their chilling (cool-ish winter temperatures needed before plants can respond to warm temperatures, but just cool — like 2-6 C is a supposed sweet spot) quickly and then a warm snap means they get that warmth they need and they start growing.

This is especially true for plants from other places that likely are not evolved for Vancouver’s climate, like daffodils.

(2) It’s been pretty warm! I bet they flowered because it has been so warm.

Deep insights, I know …. They missed me but luckily they got my colleague Doug Justice to speak and he hit my points. Doug knows plants more than I do. He also calls our cherry timing for our …

International Cherry Prediction Competition

Which is happening again this year!

You should compete! Why? You can win money, and you can help us build better models, because here’s what I would not say on TV:

We all talk about ‘chilling’ and ‘forcing’ in plants, and what we don’t tell you is that we never actually measure the physiological transition between chilling and forcing because… we aren’t sure what it is! Almost all chilling-forcing models are built on scant data where some peaches (mostly) did not bloom when they were planted in warm places 50+ years ago. We need your help!

 

The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled.

Dorothy Bishop has the story about “a chemistry lab in CNRS-Université Sorbonne Paris Nord”:

More than 20 scientific articles from the lab of one principal investigator have been shown to contain recycled and doctored graphs and electron microscopy images. That is, results from different experiments that should have distinctive results are illustrated by identical figures, with changes made to the axis legends by copying and pasting numbers on top of previous numbers. . . . the problematic data are well-documented in a number of PubPeer comments on the articles (see links in Appendix 1 of this document).

The response by CNRS [Centre National de la Recherche Scientifique] to this case . . . was to request correction rather than retraction of what were described as “shortcomings and errors”, to accept the scientist’s account that there was no intentionality, despite clear evidence of a remarkable amount of manipulation and reuse of figures; a disciplinary sanction of exclusion from duties was imposed for just one month.

I’m not surprised. The sorts of people who will cheat on their research are likely to be the same sorts of people who will instigate lawsuits, start media campaigns, and attack in other ways. These are researchers who’ve already shown a lack of scruple and a willingness to risk their careers; in short, they’re loose cannons, scary people, so it can seem like the safest strategy to not try to upset them too much, not trap them into a corner where they’ll fight like trapped rats. I’m not speaking specifically of this CNRS researcher—I know nothing of the facts of this case beyond what’s reported in Bishop’s post—I’m just speaking to the mindset of the academic administrators who would just like the problem to go away so they can get on with their regular jobs.

But Bishop and her colleagues were annoyed. If even blatant examples of scientific misconduct cannot be handled straightforwardly, what does this say about the academic and scientific process more generally? Is science just a form of social media, where people can make any sort of claim and evidence doesn’t matter?

They write:

So what should happen when fraud is suspected? We propose that there should be a prompt investigation, with all results transparently reported. Where there are serious errors in the scientific record, then the research articles should immediately be retracted, any research funding used for fraudulent research should be returned to the funder, and the person responsible for the fraud should not be allowed to run a research lab or supervise students. The whistleblower should be protected from repercussions.

In practice, this seldom happens. Instead, we typically see, as in this case, prolonged and secret investigations by institutions, journals and/or funders. There is a strong bias to minimize the severity of malpractice, and to recommend that published work be “corrected” rather than retracted.

Bishop and her colleagues continue:

One can see why this happens. First, all of those concerned are reluctant to believe that researchers are dishonest, and are more willing to assume that the concerns have been exaggerated. It is easy to dismiss whistleblowers as deluded, overzealous or jealous of another’s success. Second, there are concerns about reputational risk to an institution if accounts of fraudulent research are publicised. And third, there is a genuine risk of litigation from those who are accused of data manipulation. So in practice, research misconduct tends to be played down.

But:

This failure to act effectively has serious consequences:

1. It gives credibility to fictitious results, slowing down the progress of science by encouraging others to pursue false leads. . . . [and] erroneous data pollutes the databases on which we depend.

2. Where the research has potential for clinical or commercial application, there can be direct damage to patients or businesses.

3. It allows those who are prepared to cheat to compete with other scientists to gain positions of influence, and so perpetuate further misconduct, while damaging the prospects of honest scientists who obtain less striking results.

4. It is particularly destructive when data manipulation involves the Principal Investigator of a lab. . . . CNRS has a mission to support research training: it is hard to see how this can be achieved if trainees are placed in a lab where misconduct occurs.

5. It wastes public money from research grants.

6. It damages public trust in science and trust between scientists.

7. It damages the reputation of the institutions, funders, journals and publishers associated with the fraudulent work.

8. Whistleblowers, who should be praised by their institution for doing the right thing, are often made to feel that they are somehow letting the side down by drawing attention to something unpleasant. . . .

What happened next?

It’s the usual bad stuff. They receive a series of stuffy bureaucratic responses, none of which address any of items 1 through 8 above, let alone the problem of the data which apparently have obviously been faked. Just disgusting.

But I’m not surprised. We’ve seen it many times before:

– The University of California’s unresponsive response when informed of research misconduct by their star sleep expert.

– The American Political Science Association refusing to retract an award given to an author for a book with plagiarized material, or even to retroactively have the award shared with the people whose material was copied without acknowledgment.

– The London Times never acknowledging the blatant and repeated plagiarism by its celebrity chess columnist.

– The American Statistical Association refusing to retract an award given to a professor who plagiarized multiple times, including from wikipedia (in an amusing case where he created negative value by introducing an error into the material he’d copied, so damn lazy that he couldn’t even be bothered to proofread his pasted material).

– Cornell University . . . ok they finally canned the pizzagate dude, but only after emitting some platitudes. Kind of amazing that they actually moved on that one.

– The Association for Psychological Science: this one’s personal for me, as they ran an article that flat-out lied about me and then refused to correct it just because, hey, they didn’t want to.

– Lots and lots of examples of people finding errors or fraud in published papers and journals refusing to run retractions or corrections or even to publish letters pointing out what went wrong.

Anyway, this is one more story.

What gets my goat

What really annoys me in these situations is how the institutions show loyalty to the people who did research misconduct. When researcher X works at or publishes with institution Y, and it turns out that X did something wrong, why does Y so often try to bury the problem and attack the messenger? Y should be mad at X; after all, it’s X who has leveraged the reputation of Y for his personal gain. I’d think that the leaders of Y would be really angry at X, even angrier than people from the outside. But it doesn’t happen that way. The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled. I’m sure that Dan Davies would have something to say about all this.

In some cases academic misconduct doesn’t deserve a public apology

This is Jessica. As many of you probably saw, Claudine Gay resigned as president of Harvard this week. Her tenure as president is apparently the shortest on record, and accusations of plagiarism involving some of her published papers and her dissertation seem to be a major contributor that pushed this decision, after the initial backlash against Gay’s response alongside MIT and Penn presidents Kornbluth and Magill to questions from Republican congresswoman Stefanik about blatantly anti-semitic remarks on their campuses in the wake of Oct. 7.

The plagiarism counts are embarrassing for Gay and for Harvard, for sure, as were the very legalistic reactions of all three presidents when asked about anti-semitism on their campuses. In terms of plagiarism as a phenomena that crops up in academia, I agree with Andrew that it tells us something about the author’s lack of ability or effort to take the time to understand the material. I suspect it happens a lot under the radar, and I see it as a professor (more often with ChatGPT in the mix and no, it does not always lead to explicit punishment, to comment on what some are saying online about double standards for faculty and students). What I don’t understand is how in Gay’s case this is misconduct at the level that warrants a number of headline stories in major mainstream news media and the resignation of an administrator who has put aside her research career anyway. 

On the one hand, I can see how it is temptingly easy to rationalize why the president of what is probably the most revered university on earth cannot be associated with any academic misconduct without somehow bringing shame on the institution. She’s the president of Harvard, how can it not be shocking?! is one narrative I suppose. But, this kind of response to this situation is exactly what bothers me in the wake of her resignation. I will try to explain.

Regarding the specifics, I saw a few of the plagarized passages early on, and I didn’t see much reason to invest my time in digging further, if this was the best that could be produced by those who were obviously upset about it (I agree with Phil here that they seem like a  “weak” form of plagiarism). What makes me uncomfortable about this situation was how so many people, under the guise of being “objective,” did feel the need to invest their time in the name of establishing some kind of truth in the situation. This is the moment of decision that I wish to call attention to. It’s as though in the name of being “neutral” and “evidence based” we are absolved from having to consider why we feel so compelled in certain cases to get to the bottom of it, but not so much in other cases.  

It’s the same thing that makes so much research bad: the inability to break frame, to turn on the premise rather than the minor details. To ask, how did we get here? Why are we all taking for granted that this is the thing to be concerned with? 

Situations like what happened to Gay bring a strong sense of deja vu for me. I’m not sure how much my personal reaction is related to being female in a still largely male-dominated field myself, but I suspect it contributes. There’s a scenario that plays out from time to time where someone who is not in the majority in some academic enterprise is found to have messed up. At first glance, it seems fairly minor, somewhat relatable at least, no worse than what many others have done. But, somehow, it can’t be forgotten in some cases. Everyone suddenly exerts effort they would normally have trouble producing for a situation that doesn’t concern them that much personally to pore over the details with a fine-tooth comb to establish that there really was some fatal flaw here. The discussion goes on and becomes hard to shut out, because here is always someone else who is somehow personally offended by it. And the more it gets discussed, the more it seems like overwhelmingly a real thing to be dealt with, to be decided. It becomes an example for the sake of being principled. Once this palpable sense that ‘this is important’, ‘this is a message about our principles,’ sets in, then the details cannot be overlooked. How else can we be sure we are being rational and objective? We have to treat it like evidence and bring to bear everything we know about scrutinizing evidence. 

What is hard for me to get over is that these stories that stick around and capture so much attention are far more often stories about some member of the racial or gender non-majority who ended up in a high place. It’s like the resentment that a person from the outside has gotten in sets in without the resenter even becoming aware of it, and suddenly a situation that seems like it should have been cooperative gets much more complicated. This is not to say that people who are in the majority in a field don’t get called out or targetted sometimes, they do. Just that there’s a certain dynamic that seems to set in more readily when someone perceived as not belonging to begin with messes up. As Jamelle Watson-Daniels writes on X/Twitter of the Gay situation: “the legacy and tradition of orchestrated attacks against the credibility of Black scholars all in the name of haunting down and exposing them as… the ultimate imposters.” This is the undertone I’m talking about here.

I’ve been a professor for about 10 years, and I’ve seen this sort of hyper-attention turned on women and/or others in the non-majority who violated some minor code repeatedly in that time. In many instances, it creates a situation that divides those who are confused by the apparent level of detail orientedness given the crime and those who can’t see how there is any other way than to make the incident into an example. Gay is just the most recent reminder. 

What makes this challenging for me to write about personally is that I am a big believer in public critique, and admitting one’s mistakes. I have advocated for both on this blog. To take an example that comes up from time to time, I don’t think that because of uneven power dynamics, public critique of papers with lead student authors should be shut down, or that we owe authors extensive private communications before we openly criticize. That goes against the sort of open discussion of research flaws that we are already often incentivized to avoid. For the same reason, I don’t think that critiques made by people with ulterior motives should be dismissed. I think there were undoubtedly ulterior motives here, and I am not arguing that the information about accounts of plagiarism here should not have been shared at all. 

I also think making decisions driven by social values (which often comes up under the guise of DEI) is very complex. At least in academic computer science, we seem to be experiencing a moment of heightened sensitivity to what is perceived “moral” and “ethical”, and that often these things are defined very simplistically and tolerance for disagreement low. 

And I also think that there are situations where a transgression may seem minor but it is valuable to mind all the details and use it as an example! I was surprised for example at how little interest there seemed to be in the recent Nature Human Behavior paper which claimed to present all confirmatory analyses but couldn’t produce the evidence that the premise of the paper suggests should be readily available. This seemed to me like an important teachable moment given what the paper was advocating to begin with.  

So anyway, lots of reasons why this is hard to write about, and lots of fodder for calling me a hypocrite if you want. But I’m writing this post because the plagiarism is clearly not be the complete story here. I don’t know the full details of the Gay investigation (and admit I haven’t spent too much time researching this: I’ve seen a bunch of the plagiarism examples, but I don’t have a lot of context on her entire career). So it’s possible I’m wrong and she did some things that were truly more awful than the average Harvard president. But I haven’t heard about them yet. And either way my point still stands: there are situations with similar dynamics to this where my dedication to scientific integrity and public critique and getting to the bottom of technical details do not disappear, but are put on the backburner to question a bigger power dynamic that seems off. 

And so, while I normally I think everyone caught doing academic misconduct should acknowledge it, for the reasons above, at least at the moment, it doesn’t bother me that Gay’s resignation letter doesn’t mention the plagiarism. I think not acknowledging it was the right thing to do. 

Progress in 2023

Published:

Unpublished:

Enjoy.

Clarke’s Law, and who’s to blame for bad science reporting

Lizzie blamed the news media for a horrible bit of news reporting on the ridiculous claim that “the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction.” The press got conned by a press release from a sleazy company, which in this case was “a Silicon Valley startup” but in other settings could be a pollster or a car company or a university public relations office or an advocacy group or some other institution that has a quasi-official role in our society.

Lizzie was rightly ticked off by the media organizations that were happily playing the “sucker” role in this drama, with CNN straight-up going with the press release, along with a fawning treatment of the company that was pushing the story, and NPR going with a mildly skeptical amused tone, interviewing an actual outside expert but still making the mistake of taking the story seriously rather than framing it as a marketing exercise.

We’ve seen this sort of credulous reporting before, perhaps most notably with Theranos and the hyperloop. It’s not just that the news media are suckers, it’s that being a sucker—being credulous—is in many cases a positive for a journalist. A skeptical reporter will run fewer stories, right? Malcolm Gladwell and the Freakonomics team are superstars, in part because they’re willing to routinely turn off whatever b.s. detectors they might have, in order to tell good stories. They get rewarded for their practice of promoting unfounded claims. If we were to imagine an agent-based model of the news media, these are the agents that flow to the top. One could suppose a different model, in which mistakes tank your reputation, but that doesn’t seem to be the world in which we operate.

So, yeah, let’s get mad at the media, first for this bogus champagne story and second for using this as an excuse to promote a bogus company.

Also . . .

Let’s get mad at the institutions of academic science, which for years have been unapologetically promoting crap like himmicanes, air rage, ages ending in 9, nudges, and, let’s never forget, the lucky golf ball.

In terms of wasting money and resources, I don’t think any of those are as consequential as business scams such as Theranos or hyperloop; rather, they bother me because they’re coming from academic science, which might traditionally be considered a more trustworthy source.

And this brings us to Clarke’s law, which you may recall is the principle that any sufficiently crappy research is indistinguishable from fraud.

How does that apply here? I can only assume that the researchers behind the studies of himmicanes, air rage, ages ending in 9, nudges, the lucky golf ball, and all the rest, are sincere and really believe that their claims are supported by their data. But there have been lots of failed replications, along with methodological and statistical explanations of what went wrong in those studies. At some point, to continue to promote them is, in my opinion, on the border of fraud: it requires willfully looking away from contrary evidence and, at the extreme, leads to puffed-up-rooster claims such as, “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

In short, the corruption involved in the promotion of academic science has poisoned the well and facilitated the continuing corruption of the news media by business hype.

I’m not saying that business hype and media failure are the fault of academic scientists. Companies would be promoting themselves, and these lazy news organizations would be running glorified press releases, no matter what we were to do in academia. Nor, for that matter, are academics responsible for credulity on stories such as UFO space aliens. The elite news media seems to be able to do this all on its own.

I just don’t think that academic science hype is helping with the situation. Academic science hype helps to set up the credulous atmosphere.

Michael Joyner made a similar point a few years ago:

Why was the Theranos pitch so believable in the first place? . . .

Who can forget when James Watson. . . . co-discoverer of the DNA double helix, made a prediction in 1998 to the New York Times that so-called VEGF inhibitors would cure cancer in “two years”?

At the announcement of the White House Human Genome Project in June 2000, both President Bill Clinton and biotechnologist Craig Venter predicted that cancer would be vanquished in a generation or two. . . .

That was followed in 2005 by the head of the National Cancer Institute, Andrew von Eschenbach, predicting the end of “suffering and death” from cancer by 2015, based on a buzzword bingo combination of genomics, informatics, and targeted therapy.

Verily, the life sciences arm of Google, generated a promotional video that has, shall we say, some interesting parallels to the 2014 TedMed talk given by Elizabeth Holmes. And just a few days ago, a report in the New York Times on the continuing medical records mess in the U.S. suggested that with better data mining of more coherent medical records, new “cures” for cancer would emerge. . . .

So, why was the story of Theranos so believable in the first place? In addition to the specific mix of greed, bad corporate governance, and too much “next” Steve Jobs, Theranos thrived in a biomedical innovation world that has become prisoner to a seemingly endless supply of hype.

Joyner also noted that science hype was following patterns of tech hype. For example, this from Dr. Eric Topol, director of the Scripps Translational Science Institute:

When Theranos tells the story about what the technology is, that will be a welcome thing in the medical community. . . . I tend to believe that Theranos is a threat.

The Scripps Translational Science Institute is an academic, or at least quasi-academic, institution! But they’re using tech hype disrupter terminology by calling scam company Theranos a “threat” to the existing order. I have no reason to think that the director of the Scripps Translational Science Institute himself committing fraud? I have no reason to think so. What I do think is that he wants to have it both ways. When Theranos was riding high, he hyped it and called it a “threat” (again, that’s a positive adjective in this context). Later, after the house of cards fell, he wrote, “I met Holmes twice and conducted a video interview with her in 2013. . . . Like so many others, I had confirmation bias, wanting this young, ambitious woman with a great idea to succeed. The following year, in an interview with The New Yorker, I expressed my deep concern about the lack of any Theranos transparency or peer-reviewed research.” Actually, though, here’s what he said to the New Yorker: “I tend to believe that Theranos is a threat. But if I saw data in a journal, head to head, I would feel a lot more comfortable.” Sounds to me less like deep concern and more like hedging his bets.

Caught like a deer in the headlights between skepticism and fomo.

Extinct Champagne grapes? I can be even more disappointed in the news media

Happy New Year. This post is by Lizzie.

Over the end-of-year holiday period, I always get the distinct impression that most journalists are on holiday too. I felt this more acutely when I found an “urgent” media request in my inbox when I returned to it after a few days away. Someone at a major reputable news outlet wrote:

We are doing a short story on how the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction. We were hoping to do a quick interview with you on the topic….Our deadline is asap, as we plan to run this story on New Years.

It was late on 30 December so I had missed helping them but still had to reply that I hoped that found some better information because ‘the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction’ was not good information in my not-so-entirely-humble opinion as I study this and can think of zero-zilch-nada evidence to support this.

This sounded like insane news I would expect from more insane media outlets. I tracked down what I assume was the lead they were following (see here), and found it seems to relate to some AI start-up I will not do the service of mentioning that is just looking for more press. They seem to put out splashy sounding agricultural press releases often — and so they must have put out one about Champagne grapes being on the brink of extinction to go with New Year’s.

I am on a bad roll with AI just now, or — more exactly — the intersection of human standards and AI. There’s no good science that “the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction.” The whole idea of this is offensive to me when human actions are actually driving species extinct. And it ignores tons of science on winegrapes and the reality that they’re pretty easy to grow (growing excellent ones? Harder). So, poor form on the part of the zero-standards-for-our-science AI startup. But I am more horrified by the media outlets that cannot see through this. I am sure they’re inundated with lots of crazy bogus stories every day, but I thought that their job was to report on ones that matter and they hopefully have some evidence are true.

What did they do instead of that? They gave a platform to a “a highly adaptable marketing manager and content creator” to talk about some bogus “study” and a few soundbites to a colleague of mine who actually knew the science (Ben Cook from NASA).

Judgments versus decisions

This is Jessica. A paper called “Decoupling Judgment and Decision Making: A Tale of Two Tails” by Oral, Dragicevic, Telea, and Dimara showed up in my feed the other day. The premise of the paper is that when people interact with some data visualization, their accuracy in making judgments might conflict with their accuracy in making decisions from the visualization. Given that the authors appear to be basing the premise in part on results from a prior paper on decision making from uncertainty visualizations I did with Alex Kale and Matt Kay, I took a look. Here’s the abstract:

Is it true that if citizens understand hurricane probabilities, they will make more rational decisions for evacuation? Finding answers to such questions is not straightforward in the literature because the terms “judgment” and “decision making” are often used interchangeably. This terminology conflation leads to a lack of clarity on whether people make suboptimal decisions because of inaccurate judgments of information conveyed in visualizations or because they use alternative yet currently unknown heuristics. To decouple judgment from decision making, we review relevant concepts from the literature and present two preregistered experiments (N=601) to investigate if the task (judgment vs. decision making), the scenario (sports vs. humanitarian), and the visualization (quantile dotplots, density plots, probability bars) affect accuracy. While experiment 1 was inconclusive, we found evidence for a difference in experiment 2. Contrary to our expectations and previous research, which found decisions less accurate than their direct-equivalent judgments, our results pointed in the opposite direction. Our findings further revealed that decisions were less vulnerable to status-quo bias, suggesting decision makers may disfavor responses associated with inaction. We also found that both scenario and visualization types can influence people’s judgments and decisions. Although effect sizes are not large and results should be interpreted carefully, we conclude that judgments cannot be safely used as proxy tasks for decision making, and discuss implications for visualization research and beyond. Materials and preregistrations are available at https://osf.io/ufzp5/?view only=adc0f78a23804c31bf7fdd9385cb264f. 

There’s a lot being said here, but they seem to be getting at a difference between forming accurate beliefs from some information and making a good (e.g., utility optimal) decision. I would agree there are slightly different processes. But they are also claiming to have a way of directly comparing judgment accuracy to decision accuracy. While I appreciate the attempt to clarify terms that are often overloaded, I’m skeptical that we can meaningfully separate and compare judgments from decisions in an experiment. 

Some background

Let’s start with what we found in our 2020 paper, since Oral et al base some of their questions and their own study setup on it. In that experiment we’d had people make incentivized decisions from displays that varied only how they visualized the decision-relevant probability distributions. Each one showed a distribution of expected scores in a fantasy sports game for a team with and without the addition of a new player. Participants had to decide whether to pay for the new player or not in light of the cost of adding the player, the expected score improvement, and the amount of additional monetary award they won when they scored above a certain number of points. We also elicited a (controversial) probability of superiority judgment: What do you think is the probability your team will score more points with the new player than without? In designing the experiment we held various aspects of the decision problem constant so that only the ground truth probability of superiority was varying between trials. So we talked about the probability judgment as corresponding to the decision task.

However, after modeling the results we found that depending on whether we analyzed results from the probability response question or the incentivized decision, the ranking of visualizations changed. At the time we didn’t have a good explanation for this disparity between what was helpful for doing the probability judgment versus the decision, other than maybe it was due to the probability judgment not being directly incentivized like the decision response was. But in a follow-up analysis that applied a rational agent analysis framework to this same study, allowing us to separate different sources of performance loss by calibrating the participants’ responses for the probability task, we saw that people were getting most of the decision-relevant information regardless of which question they were responding to; they just struggled to report it for the probability question. So we concluded that the most likely reason for the disparity between judgment and decision results was probably that the probability of superiority judgment was not the most intuitive judgment to be eliciting – if we really wanted to elicit the beliefs directly corresponding to the incentivized decision task, we should have asked them for the difference in the probability of scoring enough points to win the award with and without the new player. But this is still just speculation, since we still wouldn’t be able to say in such a setup how much the results were impacted by only one of the responses being incentivized. 

Oral et al. gloss over this nuance, interpreting our results as finding “decisions less accurate than their direct-equivalent judgments,” and then using this as motivation to argue that “the fact that the best visualization for judgment did not necessarily lead to better decisions reveals the need to decouple these two tasks.” 

Let’s consider for a moment by what means we could try to eliminate ambiguity in comparing probability judgments to the associated decisions. For instance, if only incentivizing one of the two responses confounds things, we might try incentivizing the probability judgment with its own payoff function, and compare the results to the incentivized decision results. Would this allow us to directly study the difference between judgments and decision-making? 

I argue no. For one, we would need to use different scoring rules for the two different types of response, and things might rank differently depending on the rule (not to mention one rule might be easier to optimize under). But on top of this, I would argue that once you provide a scoring rule for the judgment question, it becomes hard to distinguish that response from a decision by any reasonable definition. In other words, you can’t eliminate confounds that could explain a difference between “judgment” and “decision” without turning the judgment into something indistinguishable from a decision. 

What is a decision? 

The paper by Oral et al. describes abundant confusion in the literature about the difference between judgment and decision-making, proposing that “One barrier to studying decision making effectively is that judgments and decisions are terms not well-defined and separated.“ They criticize various studies on visualizations for claiming to study decisions when they actually study judgments. Ultimately they describe their view as:

In summary, while decision making shares similarities with judgment, it embodies four distinguishing features: (I) it requires a choice among alternatives, implying a loss of the remaining alternatives, (II) it is future-oriented, (III) it is accompanied with overt or covert actions, and (IV) it carries a personal stake and responsibility for outcomes. The more of these features a judgment has, the more “decision-like” it becomes. When a judgment has all four features, it no longer remains a judgment and becomes a decision. This operationalization offers a fuzzy demarcation between judgment and decision making, in the sense that it does not draw a sharp line between the two concepts, but instead specifies the attributes essential to determine the extent to which a cognitive process is a judgment, a decision, or somewhere in-between [58], [59].

This captures components of other definitions of decision I’ve seen in research related to evaluating interfaces, e.g., as a decision as “a choice between alternatives,” typically involving “high stakes.” However, like these other definitions, I don’t think Oral et al.’s definition very clearly differentiates a decision from other forms of judgment. 

Take the “personal stake and responsibility for outcomes” part. How do we interpret this given that we are talking about subjects in an experiment, not decisions people are making in some more naturalistic context?    

In the context of an experiment, we control the stakes and one’s responsibility for their action via a scoring rule. We could instead ask people to imagine making some life or death decision in our study and call it high stakes, as many researchers do. But they are in an experiment, and they know it. In the real world people have goals, but in an experiment you have to endow them

So we should incentivize the question to ensure participants have some sense of the consequences associated with what they decide. We can ask them to separately report their beliefs, e.g., what they perceive some decision-relevant probability to be as we did in the 2020 study. But if we want to eliminate confounds between the decision and the judgment, we should incentivize the belief question too, ideally with a proper scoring rule so that it’s in their best interest to tell me the truth. Now both our decision task and our judgment task, from the standpoint of the experiment subject, would both seem to have some personal stake. So we can’t distinguish the decision easily based on its personal stakes.

Oral et al. might argue that the judgment question is still not a decision, because there are three other criteria for a decision according to their definition. Considering (I), will asking for a person’s belief require them to make a choice between alternatives? Yes, it will. Because any format we elicit their response in will naturally constrain it. Even if we just provide a text box to type in a number between 0 and 1, we’re going to get values rounded at some decimal place. So it’s hard to use “a choice among alternatives” as a distinguishing criteria in any actual experiment. 

What about (II), being future-oriented? Well, if I’m incentivizing the question then it will be just as future-oriented as my decision is, in that my payoff depends on my response and the ground truth, which is unknown to me until after I respond.

When it comes to (III), overt or covert actions, as in (I), in any actual experiment, my action space will be some form of constrained response space. It’s just that now my action is my choice of which beliefs to report. The action space might be larger, but again there is no qualitative difference between choosing what beliefs to report and choosing what action to report in some more constrained decision problem.

To summarize, by trying to put judgments and decisions on equal footing by scoring both, I’ve created something that seems to achieve Oral et al.’s definition of decision. While I do think there is a difference between a belief and a decision, I don’t think it’s so easy to measure these things without leaving open various other explanations for why the responses differ.

In their paper, Oral et al. sidestep incentivizing participants directly, assuming they will be intrinsically motivated. They report on two experiments where they used a task inspired by our 2020 paper (showing visualizations of expected score distributions and asking, Do you want the team with or without the new player, where the participant’s goal is to win a monetary award that requires scoring a certain number of points). Instead of incentivizing the decision by using the scoring rule to incentivize participants, they told them to try to be accurate. And instead of eliciting the corresponding probabilistic beliefs for the decision, they asked them two questions: Which option (team) is better?, and Which of the teams do you choose? They interpret the first answer as the judgment and the second as the decision. 

I can sort of see what they are trying to do here, but this seems like essentially the same task to me. Especially if you assume people are intrinsically motivated to be accurate and plan to evaluate responses using the same scoring rule, as they do. Why would we expect a difference between these two responses? To use a different example that came up in a discussion I was having with Jason Hartline, if you imagine a judge who cares only about doing the right thing (convicting the guilty and acquitting the innocent), who must decide whether to acquit or convict a defendant, why would you expect a difference (in accuracy) when you ask them ‘Is he guilty’ versus ‘Will you acquit or convict?’ 

In their first experiment using this simple wording, Oral et al. find no difference between responses to the two questions. In a second experiment they slightly changed the wording of the questions to emphasize that one was “your judgment” and one was “your decision.” This leads to what they say is suggestive evidence that people’s decisions are more accurate than their judgments. I’m not so sure.

The takeway

It’s natural to conceive of judgments or beliefs as being distinct from decisions. If we subscribe to a Bayesian formulation of learning from data, we expect the rational person to form beliefs about the state of the world and then choose the utility maximizing action given those beliefs. However, it is not so natural to try to directly compare judgments and decisions on equal footing in an experiment. 

More generally, when it comes to evaluating human decision-making (what we generally want to do in research related to interfaces) we gain little by preferring colloquial verbal definitions over the formalisms of statistical decision theory, which provide tools designed to evaluate people’s decisions ex-ante. It’s much easier to talk about judgment and decision-making when we have a formal way of representing a decision problem (i.e., state space, action space, data-generating model, scoring rule), and a shared understanding of what the normative process of learning from data to make a decision is (i.e., start with prior beliefs, update them given some signal, choose the action that maximizes your expected score under the data-generating model). In this case, we could get some insight into how judgments and decisions can differ simply by considering the process implied by expected utility theory. 

John Mandrola’s tips for assessing medical evidence

Ben Recht writes:

I’m a fan of physician-blogger John Mandrola. He had a nice response to your blog, using it as a jumping-off point for a short tutorial on his rather conservative approach to medical evidence assessment.

John is always even-tempered and constructive, and I thought you might enjoy this piece as an “extended blog comment.” I think he does a decent job answering the question at hand, and his approach to medical evidence appraisal is one I more or less endorse.

My post in question was called, How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer, and I concluded with this bit of helplessness: “I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature. I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.”

In his response, “Simple Rules to Understand Medical Claims,” Mandrola offers some tips:

The most important priors when it comes to medical claims are simple: most things don’t work. Most simple answer answers are wrong. Humans are complex. Diseases are complex. Single causes of complex diseases like cancer should be approached with great skepticism.

One of the studies sent to Gelman was a small trial finding that Vitamin D effectively treated COVID-19. The single-center open-label study enrolled 76 patients in early 2020. Even if this were the only study available, the evidence is not strong enough to move our prior beliefs that most simple things (like a Vitamin D tablet) do not work.

The next step is a simple search—which reveals two large randomized controlled trials of Vitamin D treatment for COVID-19, one published in JAMA and the other in the BMJ. Both were null.

You can use the same strategy for evaluating the claim that fish oil supplementation leads to higher rates of prostate cancer.

Start with prior beliefs. How is it possible that one exposure increases the rate of a disease that mostly affects older men? Answer: it’s not very possible. . . .

Now consider the claims linked in Gelman’s email.

– Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial

– Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial

While both studies stemmed from randomized trials neither were primary analyses. These were association studies using data from the main trial, and therefore, we should be cautious in making causal claims.

Now go to Google. This reveals two large randomized controlled trials of fish oil vs placebo therapy.

– The ASCEND trial of n-3 fatty acids in 15k patients with diabetes found “no significant between-group differences in the incidence of fatal or nonfatal cancer either overall or at any particular body site.” And I would add no difference in all-cause death.

– The VITAL trial included cancer as a primary endpoint. More than 25k patients were randomized. The conclusions: “Supplementation with n−3 fatty acids did not result in a lower incidence of major cardiovascular events or cancer than placebo.”

Mandrola concludes:

I am not arguing that every claim is simple. My case is that the evaluation process is slightly less daunting than Professor Gelman seems to infer.

Of course, medical science can be complicated. Content expertise can be important. . . .

But that does not mean we should take the attitude: “I have no idea what to think about these papers.”

I offer five basic rules of thumb that help in understanding medical claims:

1. Hold pessimistic priors

2. Be super-cautious about causal inferences from nonrandom observational comparisons

3. Look for big randomized controlled trials—and focus on their primary analyses

4. Know that stuff that really works is usually obvious (antibiotics for bacterial infection; AEDs to convert VF)

5. Respect uncertainty. Stay humble about most “positive” claims.

This all makes sense, as long as we recognize that randomized controlled trials are themselves nonrandom observational comparisons: the people in the study won’t in general be representative of the population of interest, also issues such as dropout, selection bias, realism of treatments, etc., which can be huge in medical trials. Experimentation is great; we just need to avoid the pitfalls of (a) idealizing studies that have randomization (we should avoid making the “chain is as strong as its strongest link” fallacy) and (b) disparaging observational data without assessing its quality.

For our discussion here, the most relevant bit of Mandrola’s advice was this from the comment thread:

Why are people going to a Political Scientist for medical advice? That is odd.

I hope Prof Gelman’s answer was based on a recognition that he doesn’t have the context and/or the historical background to properly interpret the studies.

The answer is: Yes, I do recognize my ignorance! Here’s what I wrote in the above-linked post:

I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers.

Mandrola’s advice given above seems reasonable to me. But it can be hard for me to apply in that he’s assuming a background medical knowledge that I don’t have. On the other hand, when it comes to social science, I know a lot. For example, when I saw that claim that women during a certain time of the month were 20 percentage points more likely to vote for Barack Obama, it was immediately clear this was ridiculous, because public opinion just doesn’t change that much. This had nothing to do with randomized trials or observational comparisons or anything like that; it was just too noisy of a study to learn anything.

In judo, before you learn the cool moves, you first have to learn how to fall. Maybe we should be training researchers the same way: first learn how things can go wrong, and only when you get that lesson down do you learn the fancy stuff.

I want to follow up on a suggestion from a few years ago:

In judo, before you learn the cool moves, you first have to learn how to fall. Maybe we should be training researchers, journalists, and public relations professionals the same way. First learn about Judith Miller and Thomas Friedman, and only when you get that lesson down do you get to learn about Woodward and Bernstein.

Martha in comments modified my idea:

Yes! But I’m not convinced that “First learn about Judith Miller and Thomas Friedman, and only when you get that lesson down do you get to learn about Woodward and Bernstein” or otherwise learning about people is the way to go. What is needed is teaching that involves lots of critiquing (especially by other students), with the teacher providing guidance (e.g., criticize the work or the action, not the person; no name calling; etc.) so students learn to give and accept criticism as a normal part of learning and working.

I responded:

Yes, learning in school involves lots of failure, getting stuck on homeworks, getting the wrong answer on tests, or (in grad school) having your advisor gently tone down some of your wild research ideas. Or, in journalism school, I assume that students get lots of practice in calling people and getting hung up on.

So, yes, students get the experience of failure over and over. But the message we send, I think, is that once you’re a professional it’s just a series of successes.

Another commenter pointed to this inspiring story from psychology researchers Brian Nosek, Jeffrey Spies, and Matt Motyl, who ran an experiment, thought they had an exciting result, but, just to be sure, they tried a replication and found no effect. This is a great example of how to work and explore as a scientist.

Background

Scientific research is all about discovery of the unexpected: to do research, you need to be open to new possibilities, to design experiments to force anomalies, and to learn from them. The sweet spot for any researcher is at Cantor’s corner.

Buuuut . . . researchers are also notorious for being stubborn. In particular, here’s a pattern we see a lot:
– Research team publishes surprising result A based on some “p less than .05” empirical results.
– This publication gets positive attention and the researchers and others in their subfield follow up with open-ended “conceptual replications”: related studies that also attain the “p less than .05” threshold.
– Given the surprising nature of result A, it’s unsurprising that other researchers are skeptical of A. The more theoretically-minded skeptics, or agnostics, demonstrate statistical reasons why these seemingly statistically-significant results can’t be trusted. The more empirically-minded skeptics, or agnostics, run preregistered replications studies, which fail to replicate the original claim.
– At this point, the original researchers do not apply the time-reversal heuristic and conclude that their original study was flawed (forking paths and all that). Instead they double down, insist their original findings are correct, and they come up with lots of little explanations for why the replications aren’t relevant to evaluating their original claims. And they typically just ignore or brush aside the statistical reasons why their original study was too noisy to ever show what they thought they were finding.

I’ve conjectured that one reason scientists often handle criticism in such scientifically-unproductive ways is . . . the peer-review process, which goes like this:

As scientists, we put a lot of effort into writing articles, typically with collaborators: we work hard on each article, try to get everything right, then we submit to a journal.

What happens next? Sometimes the article is rejected outright, but, if not, we’ll get back some review reports which can have some sharp criticisms: What about X? Have you considered Y? Could Z be biasing your results? Did you consider papers U, V, and W?

The next step is to respond to the review reports, and typically this takes the form of, We considered X, and the result remained significant. Or, We added Y to the model, and the result was in the same direction, marginally significant, so the claim still holds. Or, We adjusted for Z and everything changed . . . hmmmm . . . we then also though about factors P, Q, and R. After including these, as well as Z, our finding still holds. And so on.

The point is: each of the remarks from the reviewers is potentially a sign that our paper is completely wrong, that everything we thought we found is just an artifact of the analysis, that maybe the effect even goes in the opposite direction! But that’s typically not how we take these remarks. Instead, almost invariably, we think of the reviewers’ comments as a set of hoops to jump through: We need to address all the criticisms in order to get the paper published. We think of the reviewers as our opponents, not our allies (except in the case of those reports that only make mild suggestions that don’t threaten our hypotheses).

When I think of the hundreds of papers I’ve published and the, I dunno, thousand or so review reports I’ve had to address in writing revisions, how often have I read a report and said, Hey, I was all wrong? Not very often. Never, maybe?

Where we’re at now

As scientists, we see serious criticism on a regular basis, and we’re trained to deal with it in a certain way: to respond while making minimal, ideally zero, changes to our scientific claims.

That’s what we do for a living; that’s what we’re trained to do. We think of every critical review report as a pain in the ass that we have to deal with, not as a potential sign that we screwed up.

So, given that training, it’s perhaps little surprise that when our work is scrutinized in post-publication review, we have the same attitude: the expectation that the critic is nitpicking, that we don’t have to change our fundamental claims at all, that if necessary we can do a few supplemental analyses and demonstrate the robustness of our findings to those carping critics.

How to get to a better place?

How can this situation be improved? I’m not sure. In some ways, things are getting better: the replication crisis has happened, and students and practitioners are generally aware that high-profile, well-accepted findings often do not replicate. In other ways, though, I fear we’re headed in the wrong direction: students are now expected to publish peer-reviewed papers throughout grad school, so right away they’re getting on the minimal-responses-to-criticism treadmill.

It’s not clear to me how to best teach people how to fall before they learn fancy judo moves in science.

Statistical Practice as Scientific Exploration (my talk on 4 Mar 2024 at the Royal Society conference on the promises and pitfalls of preregistration)

Here’s the conference announcement:

Discussion meeting organised by Dr Tom Hardwicke, Professor Marcus Munafò, Dr Sophia Crüwell, Professor Dorothy Bishop FRS FMedSci, Professor Eric-Jan Wagenmakers.

Serious concerns about research quality have provoked debate across scientific disciplines about the merits of preregistration — publicly declaring study plans before collecting or analysing data. This meeting will initiate an interdisciplinary dialogue exploring the epistemological and pragmatic dimensions of preregistration, identifying potential limits of application, and developing a practical agenda to guide future research and optimise implementation.

And here’s the title and abstract of my talk, which is scheduled for 14h10 on Mon 4 Mar 2024:

Statistical Practice as Scientific Exploration

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? Researchers when using and developing statistical methods can be seen to be acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modelling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formerly tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow.

I won’t really be talking about preregistration, in part because I’ve already said so much on that topic here on this blog; see for example here and various links at that post. Instead I’ll be talking about the statistical workflow, which is typically presented as a set of procedures applied to data but which I think is more like a process of scientific exploration and discovery. I addressed some of these ideas in this talk from a couple years ago. But, don’t worry, I’m sure I’ll have lots of new material. Not to mention all the other speakers at the conference.

Explainable AI works, but only when we don’t need it

This is Jessica. I went to NeurIPS last week, mostly to see what it was like. While waiting for my flight home at the airport I caught a talk that did a nice job of articulating some fundamental limitations with attempts to make deep machine learning models “interpretable” or “explainable.”

It was part of an XAI workshop. My intentions in checking out the XAI workshop were not entirely pure, as its an area I’ve been skeptical of for a while. Formalizing aspects of statistical communication is very much in line with my interests, but I tried and failed to get into XAI and related work on interpretability a few years ago when it was getting popular. The ML contributions have always struck me as more of an academic exercise than a real attempt at aligning human expectations with model capabilities. When human-computer interaction people started looking into it, there started to be a little more attention to how people actually use explanations, but the methods used to study human reliance on explanations there have not been well grounded (e.g., ‘appropriate reliance’ is often defined as agreeing with the AI when it’s right and not agreeing when it’s wrong, which can be shown to be incoherent in various ways). 

The talk, by Ulrike Luxburg, which gave a sort of impossibility result for explainable AI, was refreshing. First, she distinguished two very different scenarios for explanation: the cooperative ones where you have a principal with a model furnishing the explanations and a user using them who both want the best quality/most accurate explanations, versus adversarial scenarios where you have a principal whose best interests are not aligned with the goal of accurate explanation. For example, some company who needs to explain why it denied someone a loan has little motivation to explain the actual reason behind that prediction, because it’s not in their best interest to give people fodder to then try to minimally change their features to push the prediction to a different label. Her first point was that there is little value in trying to guarantee good explanations in the adversarial case, because existing explanation techniques (e.g.,for feature attribution like SHAP or LIME) give very different explanations for the same prediction, and the same explanation technique is often highly sensitive to small differences in the function to be explained (e.g., slight changes to parameters in training). There are too many degrees of freedom in terms of selecting among inductive biases so the principal easily produce something faithful by some definition while hiding important information. Hence laws guaranteeing a right to explanation miss this point.

In the cooperative setting, maybe there is hope. But, turns out something like the anthropic principle of statistics operates here: we have techniques that we can show work well in the simple scenarios where we don’t really need explanations, but when we do really need them (e.g., deep neural nets over high dimensional feature spaces) anything we can guarantee is not going to be of much use.

There’s an analogy to clustering: back when unsupervised learning was very hot, everyone wanted guarantees for clustering algorithms but to make them required working in settings where the assumptions were very strong, such that the clusters would be obvious upon inspecting the data. In explainable AI, we have various feature attribution methods that describe which features led to the prediction on a particular instance. SHAP, which borrows Shapley values from game theory to allocate credit among features, is very popular. Typically SHAP provides the marginal contribution of each feature, but Shapley Interaction Values have been proposed to allow for local interaction effects between pairs of features. Luxburg presented a theoretical result from this paper which extends Shapley Interaction Values to n-Shapley Values, which explain individual predictions with interaction terms up to order n given some number of total features d. They are additive in that they always sum to the output of the function we’re trying to explain over all subsets of combinations of variables less than or equal to n. Starting from the original Shapley values (where n=1), n-Shapley Values successively add higher-order variable interactions to the explanations.

The theoretical result shows that n-Shapley Values recover generalized additive models (GAMs), which are GLMs where the outcome depends linearly on smoothed functions of the inputs: g(E[Y] = B_0 = f_1(x_1) + f_2(x_2) + … f_m(x_m). GAMs are considered inherently interpretable, but are also undetermined. For n-Shapley to recover a faithful representation of the function as a GAM, the order of the explanation just needs to be as large or larger than the maximum variable interaction in the model. 

However, GAMs lose their interpretability as we add interactions. When we have large numbers of features, as is typically the case in deep learning, what is the value of the explanation?  We need to look at interactions between all combinatorial subsets of the features. So when simple explanations like standard SHAP are applied to complex functions, you’re getting an average over billions of features, and there’s no reduction to be made that would give you something meaningful. The fact that in the simple setting of a GAM of order 1 we can prove SHAP does the right thing does not mean we’re anywhere close to having “solved” explainability. 

The organizers of the workshop obviously invited this rather negative talk on XAI, so perhaps the community is undergoing self-reflection that will temper the overconfidence I associate with it. Although, the day before the workshop I also heard someone complaining that his paper on calibration got rejected from the same workshop, with an accompanying explanation that it wasn’t about LIME or SHAP. Something tells me XAI will live on.

I guess one could argue there’s still value in taking a pragmatic view, where if we find that explanations of model predictions, regardless of how meaningless, lead to better human decisions in scenarios where humans must make the final decision regardless of the model accuracy (e.g., medical diagnoses, loan decisions, child welfare cases), then there’s still some value in XAI. But who would want to dock their research program on such shaky footing? And of course we still need an adequate way of measuring reliance, but I will save my thoughts on that for another post.

 Another thing that struck me about the talk was a kind of tension around just trusting one’s instincts that something is half-baked versus taking the time to get to the bottom of it. Luxburg started by talking about how her strong gut feeling as a theorist was that trying to guarantee AI explainability was not going to be possible. I believed her before she ever got into the demonstration, because it matched my intuition. But then she spent the next 30 minutes discussing an XAI paper. There’s a decision to be made sometimes, about whether to just trust your intuition and move on to something that you might still believe in versus to stop and articulate the critique. Others might benefit from the latter, but then you realize you just spent another year trying to point out issues with a line of work you stopped believing in a long time ago. Anyway, I can relate to that. (Not that I’m complaining about the paper she presented – I’m glad she took the time to figure it out as it provides a nice example). 

I was also reminded of the kind of awkward moment that happens sometimes where someone says something rather final and damning, and everyone pauses for a moment to listen to it. Then the chatter starts right back up again like it was never said. Gotta love academics!

More than 10k scientific papers were retracted in 2023

Hi all, here to talk about one of my favorite scientific topic: integrity and correction of science.

Here comes some good news for most of us and of humanity. More than 10k scientific papers have been retracted this year. Aside from the researchers who have received these notices of retractions (some of them for multiple papers), and the publishers, this is quite good news I would argue. This comes after a big year on this topic and the topic of finding fraudulent practices (see, for instance, how Guillaume Cabanac easily found papers generated by ChatGPT) and very problematic journals with, for instance, Hindawi journals probably being more problematic than others. Many retractions and reports have focused on duplicated images or use of tortured phrases. New fraudulent practices have also emerged and been found (see for instance our findings about “sneaked references” that some editors/journals have manipulated the metadata of accepted papers to increase citations of specific scholars and journals).

Of course, some like me may always see the glass half empty and I would still argue that probably many more papers should have been retracted and that, as I have lamented many times, the process of correcting the scientific literature is too slow, too opaque, and too bureaucratic while at the same time not protecting, funding, or rewarding the hard-working sleuth behind the work. Most of the sleuthing work takes place in spite of, rather than thanks to, the present publication and editorial system. Often the data or metadata to facilitate investigations is not published or available (e.g., lack of metadata about ethics or lack of metadata about reviewing practices).

Still, I guess it is kind of victory that sleuthing work is taken seriously these days I suppose, and I would like to take the opportunity of this milestone of 10k retracted paper to invite some of you to also participated in Pubpeer discussions. I am sure your input would be quite helpful there.

Happy to read thoughts and comments on the milestone and its importance. I will continue to write (a bit more regularly I hope) here on this topic.

Lonni Besançon

 

 

 

 

 

 

“Has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?”

Kelsey Piper writes:

I’m writing about the replication crisis for Vox and I was wondering if you saw this blog post from one of the DARPA replication project participants, particularly the section that argues:

I frequently encounter the notion that after the replication crisis hit there was some sort of great improvement in the social sciences, that people wouldn’t even dream of publishing studies based on 23 undergraduates any more (I actually saw plenty of those), etc. Stuart Ritchie’s new book praises psychologists for developing “systematic ways to address” the flaws in their discipline. In reality there has been no discernible improvement.

Your blog post yesterday about scientists who don’t care about doing science struck a similar tone, and I was curious: do you think we’re in a better place w/r/t the replication crisis than we were ten years ago? Or has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?

My discussion of that above-quoted blog post appeared a couple years ago. I agreed with some of that post and disagreed with other parts.

Regarding Piper’s question, “has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?,” I don’t think the influence has been zero! For one thing, this crisis has influenced my own research practices, and I assume it’s influenced many others as well. And it’s my general impression that journals such as Psychological Science and PNAS don’t publish as much junk as they used to. I haven’t done any formal study of this, though.

P.S. For some other relevant recent discussions, see More on possibly rigor-enhancing practices in quantitative psychology research and (back to basics:) How is statistics relevant to scientific discovery?.

Exploring pre-registration for predictive modeling

This is Jessica. Jake Hofman, Angelos Chatzimparmpas, Amit Sharma, Duncan Watts, and I write:

Amid rising concerns of reproducibility and generalizability in predictive modeling, we explore the possibility and potential benefits of introducing pre-registration to the field. Despite notable advancements in predictive modeling, spanning core machine learning tasks to various scientific applications, challenges such as data-dependent decision-making and unintentional re-use of test data have raised questions about the integrity of results. To help address these issues, we propose adapting pre-registration practices from explanatory modeling to predictive modeling. We discuss current best practices in predictive modeling and their limitations, introduce a lightweight pre-registration template, and present a qualitative study with machine learning researchers to gain insight into the effectiveness of pre-registration in preventing biased estimates and promoting more reliable research outcomes. We conclude by exploring the scope of problems that pre-registration can address in predictive modeling and acknowledging its limitations within this context.

Pre-registration is no silver bullet to good science, as we discuss in the paper and later in this post. However, my coauthors and I are cautiously optimistic that adapting the practice could help address a few problems that can arise in predictive modeling pipelines like research on applied machine learning. Specifically, there are two categories of concerns where pre-specifying the learning problem and strategy may lead to more reliable estimates. 

First, most applications of machine learning are evaluated using predictive performance. Usually we evaluate this on held-out test data, because it’s too costly to obtain a continuous stream of new data for training, validation and testing. The separation is crucial: performance on held-out test data is arguably the key criterion in ML, so making reliable estimates of it is critical to avoid a misleading research literature. If we mess up and access the test data during training (test set leakage), then the results we report are overfit. It’s surprisingly easy to do this (see e.g., this taxonomy of types of leakage that occur in practice). While pre-registration cannot guarantee that we won’t still do this anyway, having to determine details like how exactly features and test data will be constructed a priori could presumably help authors catch some mistakes they might otherwise make.

Beyond test set leakage, other types of data-dependent decisions threaten the validity of test performance estimates. Predictive modeling problems admit many degrees-of-freedom that authors can (often unintentionally) exploit in the interest of pushing the results in favor of some a priori hypothesis, similar to the garden of forking paths in social science modeling. For example, researchers may spend more time tuning their proposed methods than baselines they compare to, making it look like their new method is superior when it is not. They might report on straw man baselines after comparing test accuracy across multiple variations. They might only report the performance metrics that make test performance look best. Etc. Our sense is that most of the time this is happening implicitly: people end up trying harder for the things they are invested in. Fraud is not the central issue, so giving people tools to help them avoid unintentionally overfitting is worth exploring.

Whenever the research goal is to provide evidence on the predictability of some phenomena (Can we predict depression from social media? Can we predict civil war onset? etc.) there’s a risk that we exploit some freedoms in translating the high level research goal to a specific predictive modeling exercise. To take an example my co-authors have previously discussed, when predicting how many re-posts a social media post will get based on properties of the person who originally posted, even with the dataset and model specification held fixed, exercising just a few degrees of freedom can change the qualitative nature of the results. If you treat it as a classification problem and build a model to predict whether a post will receive at least 10 re-posts, you can get accuracy close to 100%. If you treat it as a regression problem and predict how many re-posts a given post gets without any data filtering, R^2 hovers around 35%. The problem is that only a small fraction of posts exceed the threshold of 10 re-posts, and predicting which posts do—and how far they spread—is very hard.  Even when the drift in goal happens prior to test set access, the results can paint an overly optimistic picture. Again pre-registering offers no guarantees of greater construct validity, but it’s a way of encouraging authors to remain aware of such drift. 

The specific proposal

One challenge we run into in applying pre-registration to predictive modeling is that because we usually aren’t aiming for explanation, we’re willing to throw lots of features into our model, even if we’re not sure how they could meaningfully contribute, and we’re agnostic to what sort of model we use so long as its inductive bias seems to work for our scenario. Deciding the model class ahead of time as we do in pre-registering explanatory models can be needlessly restrictive. So, the protocol we propose has two parts. 

First, prior to training, one answers the following questions, which are designed to be addressable before looking at any correlations between features and outcomes

Phase 1 of the protocol: learning problem, variables, dataset creation, transformations, metrics, baselines

Then, after training and validation but before accessing test data, one answers the remaining questions:

Phase 2: Prediction method, training details, access test? anything else

Authors who want to try it can grab the forms by forking this dedicated github repo and include them in their own repository.

What we’ve learned so far

To get a sense of whether researchers could benefit from this protocol, we observed as six ML Ph.D. students applied it to a prediction problem we provided (predicting depression in teens using responses to the 2016 Monitoring the Future survey of 12th graders, subsampled from data used by Orben and Przybylski). This helped us see where they struggled to pre-specify decisions in phase 1, presumably because doing so was out of line with their usual process of figuring some things out as they conducted model training and validation. We had to remind several to be specific about metrics and data transformations in particular. 

We also asked them in an exit interview what else they might have tried if their test performance had been lower than they expected. Half of the six participants described procedures that if not fully reported, seemed likely to compromise the validity of their test estimates (things like going back to re-tune hyperparameters then trying again on test data). This suggests that there’s an opportunity for pre-registration, if widely adopted, to play a role in reinforcing good workflow.  This may be especially useful in fields where ML models are being applied by expertise in predictive modeling is still sparse.

The caveats 

It was reassuring to directly observe examples where this protocol, if followed, might have prevented overfitting. However, the fact that we saw these issues despite having explained and motivated pre-registration during these sessions, and walked the participants through it, suggests that pre-specifying certain components of a learning pipeline alone is not necessarily enough to prevent overfitting. 

It was also notable that while all of the participants but one saw value in pre-registering, their specific understandings of why and how it could work varied. There was as much variety in their understandings of pre-registration as there was in ways they approached the same learning problem. Pre-registration is not going to be the same thing to everyone nor even used the same way, because the ways it helps are multi-faceted. As a result, it’s dangerous to interpret the mere act of pre-registration as a stamp of good science. 

I have some major misgivings about putting too much faith into the idea that publicly pre-registering guarantees that estimates are valid, and hope that this protocol gets used responsibly, as something authors choose to do because they feel it helps them prevent unintentional overfitting rather than the simple solution that guarantees to the world that your estimates are gold. It was nice to observe that a couple of study participants seemed particularly drawn to the idea of pre-registering based on perceived “intrinsic” value, remarking about the value they saw in it as a personally-imposed set of constraints to incorporate in their typical workflow.

It won’t work for all research projects. One participant figured out while talking aloud that prior work he’d done identifying certain behaviors in transformer models would have been hard to pre-register because it was exploratory in nature.

Another participant fixated on how the protocol was still vulnerable: people could lie about not having already experimented with training and validation, there’s no guarantee that the train/test split authors describe is what they actually used to produce their estimates, etc. Computer scientists tend to be good at imagining loopholes that adversarial attacks could exploit, so maybe they will be less likely to oversell pre-registration as guaranteeing validity. At the end of the day, it’s still an honor system. 

As we’ve written before, part of the issue with many claims in ML-based research is that often performance estimates for some new approach represent something closer to best case performance due to overlooked degrees of freedom, but they can get interpreted as expected performance. Pre-registration is an attempt at ensuring that the estimates that get reported are more likely to be represent what they’re meant to be. Maybe it’s better though to try to change readers’ perceptions that they can be taken at face value to begin with, though. I’m not sure. 

We’re open to feedback on the specific protocol we provide and curious to hear how it works out for those who try it. 

P.S. Against my better judgment, I decided to go to NeurIPS this year. If you want to chat pre-registration or threats to the validity of ML performance estimates find me there Wed through Sat.

Modest pre-registration

This is Jessica. In light of the hassles that can arise when authors make clear that they value pre-registration by writing papers about its effectiveness but then they can’t find their pre-registration, I have been re-considering how I feel about the value of the public aspects of pre-registration. 

I personally find pre-registration useful, especially when working with graduate students (as I am almost always doing). It gets us to agree on what we are actually hoping to see and how we are going to define the key quantities we compare. I trust my Ph.D. students, but when we pre-register we are more likely to find the gaps between our goals and the analyses that we can actually do because we have it all in a single document that we know cannot be further revised after we start collecting data.

Shravan Vasishth put it well in a comment on a previous post:

My lab has been doing pre-registrations for several years now, and most of the time what I learned from the pre-registration was that we didn’t really adequately think about what we would do once we have the data. My lab and I are getting better at this now, but it took many attempts to do a pre-registration that actually made sense once the data were in. That said, it’s still better to do a pre-registration than not, if only for the experimenter’s own sake (as a sanity pre-check). 

The part I find icky is that as soon as pre-registration gets discussed outside the lab, it often gets applied and interpreted as a symbol that the research is rigorous. Like the authors who pre-register must be doing “real science.” But there’s nothing about pre-registration to stop sloppy thinking, whether that means inappropriate causal inference, underspecification of the target population, overfitting to the specific experimental conditions, etc.

The Protzko et al. example could be taken as unusual, in that we might not expect the average reviewer to feel the need to double check the pre-registration when they see that author list includes Nosek and Nelson. On the other hand, we could see it as particularly damning evidence of how pre-registration can fail in practice, when some of the researchers that we associate with the highest standards of methodological rigor are themselves not appearing to take claims made about what practices were followed so seriously as to make sure they can back them up when asked. 

My skepticism about how seriously we should take public declarations of pre-registration is influenced by my experience as author and reviewer, where, at least in the venues I’ve published in, when you describe your work as pre-registered it wins points with reviewers, increasing the chances that someone will comment about the methodological rigor, that your paper will win an award, etc. However, I highly doubt the modal reviewer or reader is checking the preregistration. At least, no reviewer has ever asked a single question about the pre-registration in any of the studies I’ve ever submitted, and I’ve been using pre-registration for at least 5 or 6 years. I guess it’s possible they are checking it and it’s just all so perfectly laid out in our documents and followed to a T that there’s nothing to question. But I doubt that… surely at some point we’ve forgotten to fully report a pre-specified exploratory analysis, or the pre-registration wasn’t clear, or something else like that. Not a single question ever seems fishy.

Something I dislike about authors’ incentives when reporting on their methods in general is that reviewers (and readers) can often be unimaginative. So what the authors say about their work can set the tone for how the paper is received. I hate when authors describe their own work in a paper as “rigorous” or “highly ecologically valid” or “first to show” rather than just allowing the details to speak for themselves. It feels like cheap marketing. But I can understand why some do it, because one really can impress some readers for saying such things. Hence, points won for mentioning pre-registration, but no real checks and balances, can be a real issue.  

How should we use pre-registration in light of all this? If nobody cares to do the checking, but extra credit is being handed out when authors slap the “pre-registered” label on their work, maybe we want to pre-register more quietly.

At the extreme, we could pre-register amongst ourselves, in our labs or whatever, without telling everyone about it. Notify our collaborators by email or slack or whatever else when we’ve pinned down the analysis plan and are ready to collect the data but not expect anyone else to care, except maybe when they notice that our research is well-engineered in general, because we are the kind of authors who do our best to keep ourselves honest and use transparent methods and subject our data to sensitivity analyses etc. anyways.

I’ve implied before on the blog that pre-registration is something I find personally useful but see externally as a gesture toward transparency more than anything else. If we can’t trust authors when they claim to pre-register, but we don’t expect the reviewing or reading standards in our communities to evolve to the point where checking to see what it actually says becomes mainstream, then we could just omit the signaling aspect altogether and continue to trust that people are doing their best. I’m not convinced we would lose much in such a world as pre-registration is currently practiced in the areas I work in. Maybe the only real way to fix science is to expect people to find reasons to be self-motivated to do good work. And if they don’t, well, it’s probably going to be obvious in other ways than just a lack of pre-registration. Bad reasoning should be obvious and if it’s not, maybe we should spend more time training students on how to recognize it.

But of course this seems unrealistic, since you can’t stop people from saying things in papers that they think reviewers will find relevant. And many reviewers have already shown they find it relevant to hear about a pre-registration. Plus of course the only real benefit we can say with certainty that pre-registration provides is that if one pre-registers, others can verify to what extent the the analysis was planned beforehand and therefore less subject to authors exploiting degrees of freedom, so we’d lose this.  

An alternative strategy is to be more specific about pre-registration while crowing about it less. Include the pre-registration link in your manuscript but stop with all the label-dropping that often occurs, in the abstract, the introduction, sometimes in the title itself describing how this study is pre-registered. (I have to admit, I have been guilty of this, but from now on I intend to remove such statements from papers I’m on).

Pre-registration statements should be more specific, in light of the fact that we can’t expect reviewers to catch deviations themselves. E.g., if you follow your pre-registration to a T, say something like “For each of our experiments, we report all sample sizes, conditions, data exclusions, and measures for the main analyses that were described in our pre-registration documents. We do not report any analyses that were not included in our pre-registration.” That makes it clear what you are knowingly claiming regarding the pre-registration status of your work. 

Of course, some people may say reasonably specific things even when they can’t back them up with a pre-registration document. But being specific at least acknowledges that a pre-registration is actually a bundle of details that we must mind if we’re going to claim to have done it, because they should impact how it’s assessed. Plus maybe the act of typing out specific propositions would remind some authors to check what their pre-registration actually says. 

If you don’t follow your pre-registration to a T, which I’m guessing is more common in practice, then there are a few strategies I could see using:

Put in a dedicated paragraph before you describe results detailing all deviations from what you pre-registered. If it’s a whole lot of stuff, perhaps the act of writing this paragraph will convince you to just skip reporting on the pre-registration altogether because it clearly didn’t work out. 

Label each individual comparison/test as pre-registered versus not as you walk through results. Personally I think this makes things harder to keep track of than a single dedicated paragraph, but maybe there are occasionally situations where its better.

(back to basics:) How is statistics relevant to scientific discovery?

Following up on today’s post, “Why I continue to support the science reform movement despite its flaws,” it seems worth linking to this post from 2019, about the way in which some mainstream academic social psychologists have moved beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct:

Once you accept that the replication rate is not 100%, nor should it be, and once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists whom academic insiders used to refer to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery. . . .

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. . . .

Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start. . . .

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work. . . .

We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

– Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

– React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

– Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

– Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

There’s more at the link, and also let me again plug my recent article, Before data analysis: Additional recommendations for designing experiments to learn about the world.

Why I continue to support the science reform movement despite its flaws

I was having a discussion with someone about problems with the science reform movement (as discussed here by Jessica), and he shared his opinion that “Scientific reform in some corners has elements of millenarian cults. In their view, science is not making progress because of individual failings (bias, fraud, qrps) and that if we follow a set of rituals (power analysis, preregistration) devised by the leaders than we can usher in a new era where the truth is revealed (high replicability).”

My quick reaction was that this reminded me of an annoying thing where people use “religion” as a term of insult. When this came up before, I wrote that maybe it’s time to retire use of the term “religion” to mean “uncritical belief in something I disagree with.”

But then I was thinking about this all from another direction, and I think there’s something there there. Not the “millenarian cults” thing, which I think was an overreaction on my correspondent’s part.

Rather, I see a paradox. From his perspective, my correspondent sees the science reform movement as having a narrow perspective, an enforced conformity that leads it into unforced errors such as publishing a high-profile paper promoting preregistration without actually itself following preregistered analysis plans. OK, he doesn’t see all of the science reform movement as being so narrow—for one thing, I’m part of the science reform movement and I wasn’t part of that project!—but he seems some core of the movement being stuck in narrow rituals and leader-worship.

But I think it’s kind of the opposite. From my perspective, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment, especially within academic psychology, in order to keep them on board. To get funding, institutional support, buy-in from key players, . . . that takes a lot of political maneuvering.

I don’t say this lightly, and I’m not using “political” as a put-down. I’m a political scientist, but personally I’m not very good at politics. Politics takes hard work, requiring lots of patience and negotiation. I’m impatient and I hate negotiation; I’d much rather just put all my cards face-up on the table. For some activities, such as blogging and collaborative science, these traits are helpful. I can’t collaborate with everybody, but when the connection’s there, it can really work.

But there’s more to the world than this sort of small-group work. Building and maintaining larger institutions, that’s important too.

So here’s my point: Some core problems with the open-science movement are not a product of cult-like groupthink. Rather, it’s the opposite: this core has been structured out of a compromise with some groups within psychology who are tied to old-fashioned thinking, and this politically-necessary (perhaps) compromise has led to some incoherence, in particular the attitude or hope that, by just including some preregistration here and getting rid of some questionable research practices there, everyone could pretty much continue with business as usual.

Summary

The open-science movement has always had a tension between burn-it-all-down and here’s-one-quick-trick. Put them together and it kinda sounds like a cult that can’t see outward, but I see it as more the opposite, as an awkward coalition representing fundamentally incoherent views. But both sides of the coalition need each other: the reformers need the old institutional powers to make a real difference in practice, and the oldsters need the reformers because outsiders are losing confidence in the system.

The good news

The good news for me is that both groups within this coalition should be able to appreciate frank criticism from the outside (they can listen to me scream and get something out of it, even if they don’t agree with all my claims) and should also be able to appreciate research methods: once you accept the basic tenets of the science reform movement, there are clear benefits to better measurement, better design, and better analysis. In the old world of p-hacking, there was no real reason to do your studies well, as you could get statistical significance and publication with any old random numbers, along with a few framing tricks. In the new world of science reform—even imperfect science reform, this sort of noise mining isn’t so effective, and traditional statistical ideas of measurement, design, and analysis become relevant again.

So that’s one reason I’m cool with the science reform movement. I think it’s in the right direction: its dot product with the ideal direction is positive. But I’m not so good at politics so I can’t resist criticizing it too. It’s all good.

Reactions

I sent the above to my correspondent, who wrote:

I don’t think it is a literal cult in the sense that carries the normative judgments and pejorative connotations we usually ascribe to cults and religions. The analogy was more of a shorthand to highlight a common dynamic that emerges when you have a shared sense of crisis, ritualistic/procedural solutions, and a hope that merely performing these activities will get past the crisis and bring about a brighter future. This is a spot where group-think can, and at times possibly should, kick in. People don’t have time to each individually and critically evaluate the solutions, and often the claim is that they need to be implemented broadly to work. Sometimes these dynamics reflect a real problem with real solutions, sometimes they’re totally off the rails. All this is not to say I’m opposed to scientific reform; I’m very much for it in the general sense. There’s no shortage of room for improvement in how we turn observations into understanding, from improving statistical literacy and theory development to transparency and fostering healthier incentives. I am, however, wary of the uncritical belief that the crisis is simply one of failed replications and that the performance of “open science rituals” is sufficient for reform, across the breadth of things we consider science. As a minor point, I don’t think many of the vast majority of prominent figures in open science intend for these dynamics to occur, but I do think they all should be wary of them.

There does seem to be a problem that many researchers are too committed to the “estimate the effect” paradigm and don’t fully grapple with the consequences of high variability. This is particularly disturbing in psychology, given that just about all psychology experiments study interactions, not main effects. Thus, a claim that effect sizes don’t vary much is a claim that effect sizes vary a lot in the dimension being studied, but have very little variation in other dimensions. Which doesn’t make a lot of sense to me.

Getting back to the open-science movement, I want to emphasize the level of effort it takes to conduct and coordinate these big group efforts, along with the effort required to keep together that the coalition of skeptics (who see preregistration as a tool for shooting down false claims) and true believers (who see preregistration as a way to defuse skepticism about their claims) and get these papers published in top journals. I’d also say it takes a lot of effort for them to get funding, but that would be kind of a cheap shot, given that I too put in a lot of effort to get funding!

Anyway, to continue, I think that some of the problems with the science reform movement are that it effectively promises different things to different people. And another problem is with these massive projects that inevitably include things that not all the authors will agree with.

So, yeah, I have a problem with simplistic science reform prescriptions, for example recommendations to increase sample size without any nod toward effect size and measurement. But much much worse, in my opinion, are the claims of success we’ve seen from researchers and advocates who are outside the science-reform movement. I’m thinking here about ridiculous statements such as the unfounded claim of 17 replications of power pose, or the endless stream of hype from the nudgelords, or the “sleep is your superpower” guy, or my personal favorite, the unfounded claim from Harvard that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

It’s almost enough to stop here with the remark that the scientific reform movement has been lucky in its enemies.

But I also want to say that I appreciate that the “left wing” of the science reform movement—the researchers who envision replication and preregistration and the threat of replication and preregistration as a tool to shoot down bad studies—have indeed faced real resistance within academia and the news media to their efforts, as lots of people will hate the bearers of bad news. And I also appreciate that the “right wing” of the science reform movement—the researchers who envision replication and preregistration as a way to validate their studies and refute the critics—in that they’re willing to put their ideas to the test. Not always perfectly, but you have to start somewhere.

While I remain annoyed at certain aspects of the mainstream science reform movement, especially when it manifests itself in mass-authored articles such as the notorious recent non-preregistered paper on the effects of preregistration, or that “Redefine statistical significance” article, or various p-value hardliners we’ve encountered over the decades, I also respect the political challenges of coalition-building that are evident in that movement.

So my plan remains to appreciate the movement while continuing to criticize its statements that seem wrong or do not make sense.

I sent the above to Jessica Hullman, who wrote:

I can relate to being surprised by the reactions of open science enthusiasts to certain lines of questioning. In my view, how to fix science is as about a complicated question as we will encounter. The certainty/level of comfortableness with making bold claims that many advocates of open science seem to have is hard for me to understand. Maybe that is just the way the world works, or at least the way it works if you want to get your ideas published in venues like PNAS or Nature. But the sensitivity to what gets said in public venues against certain open science practices or people reminds me very much of established academics trying to hush talk about problems in psychology, as though questioning certain things is off limits. I’ve been surprised on the blog for example when I think aloud about something like preregistration being imperfect and some commenters seem to have a visceral negative reaction to see something like that written. To me that’s the opposite of how we should be thinking.

As an aside, someone I’m collaborating with recently described to me his understanding of the strategy for getting published in PNAS. It was 1. Say something timely/interesting, 2. Don’t be wrong. He explained that ‘Don’t be wrong’ could be accomplished by preregistering and large sample size. Naturally I was surprised to hear #2 described as if it’s really that easy. Silly me for spending all this time thinking so hard about other aspects of methods!

The idea of necessary politics is interesting; not what I would have thought of but probably some truth to it. For me many of the challenges of trying to reform science boil down to people being heuristic-needing agents. We accept that many problems arise from ritualistic behavior, but we have trouble overcoming that, perhaps because no matter how thoughtful/nuanced some may prefer to be, there’s always a larger group who want simple fixes / aren’t incentivized to go there. It’s hard to have broad appeal without being reductionist I guess.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

Of course its preregistered. Just give me a sec

This is Jessica. I was going to post something on Bak-Coleman and Devezer’s response to the Protzko et al. paper on the replicability of research that uses rigor-enhancing practices like large samples, preregistration, confirmatory tests, and methodological transparency, but Andrew beat me to it. But since his post didn’t get into one of the surprising aspects of their analysis (beyond the paper making causal claim without a study design capable of assessing causality), I’ll blog on it anyway.

Bak-Coleman and Devezer describe three ways in which the measure of replicability that Protzko et al. use to argue that the 16 effects they study are more replicable than effects in prior studies deviates from prior definitions of replicability:

  1. Protzko et al. define replicability as the chance that any replication achieves significance in the hypothesized direction as opposed to whether the results of the confirmation study and the replication were consistent 
  2. They include self-replications in calculating the rate
  3. They include repeated replications of the same effect and replications across different effects in calculating the rate

Could these deviations in how replicability is defined have been decided post-hoc, so that the authors could present positive evidence for their hypothesis that rigor-enhancing practices work? If they preregistered their definition of replicability, we would not be so concerned about this possibility.  Luckily, the authors report that “All confirmatory tests, replications and analyses were preregistered both in the individual studies (Supplementary Information section 3 and Supplementary Table 2) and for this meta-project (https://osf.io/6t9vm).”

But wait – according to Bak-Coleman and Devezer:

the analysis on which the titular claim depends was not preregistered. There is no mention of examining the relationship between replicability and rigor-improving methods, nor even how replicability would be operationalized despite extensive descriptions of the calculations of other quantities. With nothing indicating this comparison or metric it rests on were planned a priori, it is hard to distinguish the core claim in this paper from selective reporting and hypothesizing after the results are known. 

Uh-oh, that’s not good. At this point, some OSF sleuthing was needed. I poked around the link above, and the associated project containing analysis code. There are a couple analysis plans: Proposed Overarching Analyses for Decline Effect final.docx, from 2018, and Decline Effect Exploratory analyses and secondary data projects P4.docx, from 2019. However, these do not appear to describe the primary analysis of replicability in the paper (the first describes an analysis that ends up in the Appendix, and the second a bunch of exploratory analyses that don’t appear in the paper). About a year later, the analysis notebooks with the results they present in the main body of the paper were added. 

According to Bak-Coleman on X/Twitter: 

We emailed the authors a week ago. They’ve been responsive but as of now, they can’t say one way or another if the analyses correspond to a preregistration. They think they may be in some documentation.

In the best case scenario where the missing preregistration is soon found, this example suggests that there are still many readers and reviewers for whom some signal of rigor suffices even when the evidence of it is lacking. In this case, maybe the reputation of authors like Nosek reduced the perceived need on the part of the reviewers to track down the actual preregistration. But of course, even those who invented rigor-enhancing practices can still make mistakes!

In the alternative scenario where the preregistration is not found soon, what is the correct course of action? Surely at least a correction is in order? Otherwise we might all feel compelled to try our luck at signaling preregistration without having to inconvenience ourselves by actually doing.

More optimistically, perhaps there are exciting new research directions that could come out of this. Like, wearable preregistration, since we know from centuries of research and practice that it’s harder to lose something when it’s sewn to your person. Or, we could submit our preregistrations to OpenAI, I mean Microsoft, who could make a ChatGPT-enabled Preregistration Buddy who not only trained on your preregistration, but also knows how to please a human judge who wants to ask questions about what it said.

More on possibly rigor-enhancing practices in quantitative psychology research

In an paper entitled, “Causal claims about scientific rigor require rigorous causal evidence,” Joseph Bak-Coleman and Berna Devezer write:

Protzko et al. (2023) claim that “High replicability of newly discovered social-behavioral findings is achievable.” They argue that the 86% rate of replication observed in their replication studies is due to “rigor-enhancing practices” such as confirmatory tests, large sample sizes, preregistration and methodological transparency. These findings promise hope as concerns over low rates of replication have plagued the social sciences for more than a decade. Unfortunately, the observational design of the study does not support its key causal claim. Instead, inference relies on a post hoc comparison of a tenuous metric of replicability to past research that relied on incommensurable metrics and sampling frames.

The article they’re referring to is by a team of psychologists (John Protzko, Jon Krosnick, et al.) reporting “an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . .”

When I heard about that paper, I teed off on their proposed list of rigor-enhancing practices.

I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern is that the authors or readers of that article will think that these are the best rigor-enhancing practices in science (or social science, or psychology, or social psychology, etc.), or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Instead, I gave my top 5 rigor-enhancing practices, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly.

2. Increase your effect size, e.g., do a more effective treatment.

3. Focus your study on the people and scenarios where effects are likely to be largest.

4. Improve your outcome measurement.

5. Improve pre-treatment measurements.

The suggestions of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” are all fine, but I think all are less important than the 5 steps listed above. You can read the linked post to see my reasoning; also there’s Pam Davis-Kean’s summary, “Know what the hell you are doing with your research.” You might say that goes without saying, but it doesn’t, even in some papers published in top journals such as Psychological Science and PNAS!

You can also read a response to my post from Brian Nosek, a leader in the replication movement and one of the coauthors of the article being discussed.

In their new article, Bak-Coleman and Devezer take a different tack than me, in that they’re focused on challenges of measuring replicability of empirical claims in psychology, whereas I was more interested in the design of future studies. To a large extent, I find the whole replicability thing important to the extent that it gives researchers and users of research less trust in generic statistics-backed claims; I’d guess that actual effects typically vary so much based on context that new general findings are mostly not to be trusted. So I’d say that Protzko et al., Nosek, Bak-Coleman and Devezer, and I are coming from four different directions. (Yes, I recognize that Nosek is one of the authors of the Protzko et al. paper; still, in his blog comment he seemed to have a slightly different perspective). The article by Bak-Coleman and Devezer seems very relevant to any attempt to understand the empirical claims of Protzko et al.