More than 10k scientific papers were retracted in 2023

Hi all, here to talk about one of my favorite scientific topic: integrity and correction of science.

Here comes some good news for most of us and of humanity. More than 10k scientific papers have been retracted this year. Aside from the researchers who have received these notices of retractions (some of them for multiple papers), and the publishers, this is quite good news I would argue. This comes after a big year on this topic and the topic of finding fraudulent practices (see, for instance, how Guillaume Cabanac easily found papers generated by ChatGPT) and very problematic journals with, for instance, Hindawi journals probably being more problematic than others. Many retractions and reports have focused on duplicated images or use of tortured phrases. New fraudulent practices have also emerged and been found (see for instance our findings about “sneaked references” that some editors/journals have manipulated the metadata of accepted papers to increase citations of specific scholars and journals).

Of course, some like me may always see the glass half empty and I would still argue that probably many more papers should have been retracted and that, as I have lamented many times, the process of correcting the scientific literature is too slow, too opaque, and too bureaucratic while at the same time not protecting, funding, or rewarding the hard-working sleuth behind the work. Most of the sleuthing work takes place in spite of, rather than thanks to, the present publication and editorial system. Often the data or metadata to facilitate investigations is not published or available (e.g., lack of metadata about ethics or lack of metadata about reviewing practices).

Still, I guess it is kind of victory that sleuthing work is taken seriously these days I suppose, and I would like to take the opportunity of this milestone of 10k retracted paper to invite some of you to also participated in Pubpeer discussions. I am sure your input would be quite helpful there.

Happy to read thoughts and comments on the milestone and its importance. I will continue to write (a bit more regularly I hope) here on this topic.

Lonni Besançon

 

 

 

 

 

 

A client tried to stiff me for $5000. I got my money, but should I do something?

This post is by Phil Price, not Andrew.

A few months ago I finished a small consulting contract — it would have been less than three weeks, if I worked on it full time — and I find it has given me some things to think about, concerning statistical modeling (no surprise there) but also ethics. There’s no particular reason anyone would be interested in hearing me ramble on about what was involved in the job itself, but I’m going to do that anyway for a few paragraphs. Maybe it will be of interest to others who are considering going into consulting. If you are here for the ethical question then you can skip the next several paragraphs; pick up the story at the line of XXXX, far below.

Continue reading

The NeurIPS 2020 broader impacts experiment

This year NeurIPS, a top machine learning conference, required a broader impacts statement from authors. From the call:

 In order to provide a balanced perspective, authors are required to include a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences. Authors should take care to discuss both positive and negative outcomes

I heard that next year ICML, another top ML conference, will add the same requirement. 

Questions about how to make computer scientists more thoughtful of the potential societal implications of what they create have been around for awhile, with an increasing number of researchers looking at how to foster reflection or transparency through different design methods, fact sheets to go along with tech contributions, etc. But a requirement that all authors try to address broader societal implications in publications is a new thing. Actions like these are part of a reform movement aimed at shifting computer science values rather drastically away from the conventional view that algorithms and math are outside of any moral philosophy. 

Here, I’m not going to take on bigger questions of what this kind of action means or how useful it is, and instead reflect more on how it was done, and what questions I, as a curious outsider (I don’t publish at NeurIPS), have had in looking at the official messaging about the broader impacts statement. It’s felt a bit like doing a puzzle. 

While the call doesn’t go into too much detail about how the statements should be written or used, the FAQ for authors says:

Do I have to complete the Broader Impact section? Answer: Yes, please include the section in the submission for review. However, if your work is very theoretical or is general enough that there is no particular application foreseen, then you are free to write that a Broader Impact discussion is not applicable.

So, some acknowledgment of the requirement is required at best. How is it used in the reviewing process?

Can my submission be rejected solely on the basis of the Broader Impact Section? Answer: No. Reviewers will be asked to rate a submission based on the evaluation criteria. They will also be asked to check whether the Broader Impact is adequately addressed. In general, if a paper presents theoretical work without any foreseeable impact in the society, authors can simply state “This work does not present any foreseeable societal consequence”. If a paper presents a method or an application that might have reasonable chances to have some broader impact, authors can discuss along the following lines: “This work has the following potential positive impact in the society…. At the same time, this work may have some negative consequences because… Furthermore, we should be cautious of the result of failure of the system which could cause…” 

I checked out the evaluation criteria which are also part of the call, which include this sentence about broader impacts:

Regardless of scientific quality or contribution, a submission may be rejected for ethical considerations, including methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury. 

It’s a little ambiguous, but since they say above that a submission cannot be rejected solely on the basis of the Broader Impact section, I assume the reviewer would have to point to other parts of the paper (the work itself?) to argue that there’s a problem. Maybe we are ruling out rejecting cases where the science is sound and reasonably ethical, but the broader impacts statement is badly written?  

The FAQ also includes this:

How should I write the Broader Impact section? Answer: For additional motivation and general guidance, read Brent Hecht et al.’s white paper and blogpost, as well as this blogpost from the Centre for Governance of AI. For an example of such a discussion, see sec. 4 in this paper from Gillick et al. 

So I looked at some of the links, and the blogpost by Hecht gives some important info about how reviewers are supposed to read these things: 

As authors, you’re also likely reviewers (esp. this year!). NeurIPS leaders should probably address this directly, but our proposal’s view is that it’s not your job as a reviewer to judge submissions for their impacts. Rather, you should evaluate the *rigor with which they disclose their impacts*. Our proposal also recommends that reviewers adopt a “big tent” approach as “norms and standards develop”.

So reviewers should judge how rigorously authors reported impacts, but they can’t reject papers on the basis of them. The NeurIPS reviewer guidelines say a little more, basically echoing this point that the reviewer’s judgment is about whether the authors did an adequate job reflecting on positive and negative potential impacts. 

Right after this point, the reviewer guidelines mention the broader issue of ethical concerns as a possible reason for paper rejection:

Does the submission raise potential ethical concerns? This includes methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury. If so, please explain briefly.

Yes or No. Explain if the submission might raise any potential ethical concern. Note that your rating should be independent of this. If the AC also shares this concern, dedicated reviewers with expertise at the intersection of ethics and ML will further review the submission. Your duty here is to flag only papers that might need this additional revision step.

Something unsatisfying about this piece for me is that as a reviewer, I’m told to assess the broader impacts only for how well it reports on positive and negative potential, but then I can raise ethical concerns with the paper. In a case where I would not have noticed an ethical concern with the paper, but a convincing broader impacts brings one to my attention, I assume I can then flag the paper for an ethics concern. If the dedicated ethics reviewers described above then decided to reject the paper, I think it’s still fair to say that this would not be an example of the paper being rejected solely on the basis of the broader impacts. But the broader impacts statement could have effectively handed over the matches that started the fire.

This makes me wonder, if I were an author who feels they have some private information about possible ethical consequences that may not be available to their reviewers, based, e.g., on the amount of time I’ve spent thinking about their specific topic, would I be motivated to share it? It seems like it would be slightly safer to keep it to myself, since even if the reviewers are smarter than I think, it shouldn’t make the outcome of my paper any worse. 

NeurIPS is now over, so some authors may be looking for extra feedback about the broader impacts statements in the form of actual paper outcomes. NeurIPS program chairs published a Medium post with some stats. 

This year, we required authors to include a broader impact statement in their submission. We did not reject any papers on the grounds that they failed to meet this requirement. However, we will strictly require that this section be included in the camera-ready version of the accepted papers. As you can see from the histogram of the number of words in this section, about 9% of the submission did not have such a section, and most submissions had a section with about 100 words.

We appointed an ethics advisor and invited a pool of 22 ethics reviewers (listed here) with expertise in fields such as AI policy, fairness and transparency, and ethics and machine learning. Reviewers could flag papers for ethical concerns, such as submissions with undue risk of harm or methods that might increase unfair bias through improper use of data, etc. Papers that received strong technical reviews yet were flagged for ethical reasons were assessed by the pool of ethics reviewers.

Thirteen papers met these criteria and received ethics reviews. Only four papers were rejected because of ethical considerations, after a thorough assessment that included the original technical reviewers, the area chair, the senior area chair and also the program chairs. Seven papers flagged for ethical concerns were conditionally accepted, meaning that the final decision is pending the assessment of the area chair once the camera ready version is submitted. Some of these papers require a thorough revision of the broader impact section to include a clearer discussion of potential risks and mitigations, and others require changes to the submission such as the removal of problematic datasets. Overall, we believe that the ethics review was a successful and important addition to the review process. Though only a small fraction of papers received detailed ethical assessments, the issues they presented were important and complex and deserved the extended consideration. In addition, we were very happy with the high quality of the assessments offered by the ethics reviewers, and the area chairs and senior area chairs also appreciated the additional feedback.

This seems mostly consistent with what was said in the pre-conference descriptions, but doesn’t seem to rule out the concern that a good broader impacts statement could be the reason a reviewer flags a paper for ethics. This made me wonder more about how the advocates frame the value of this exercise for authors. 

So I looked for evidence of what intentions the NeurIPS leadership seemed to have in mind in introducing these changes. I even attended a few NeurIPS workshops that were related, in an effort to become a more informed computer scientist myself. Here’s what I understand the intentions to be:

  1. To give ML researchers practice reflecting on ethics implications of their work, so they hopefully make more socially responsible tech in the future.
  2. To encourage more “balanced” reporting on research, i.e., “remove the rose-colored glasses”, as described here.
  3. To help authors identify future research they could do to help address negative societal outcomes of their work. 
  4. To generate interesting data on how ML researchers react to an ethics prompt.

One thing I’ve heard relatively consistently from advocates is that this is a big experiment. This seems like an honest description, conveying that they recognize that requiring reflection on ethics is a big change to expectations placed on CS researchers and since they don’t know exactly how to best produce this reflection so they’re experimenting. 

But implying that it’s all a big open-ended experiment seems counterproductive to building trust among researchers in the organizers’ vision around ethics. Don’t get me wrong, I’m in favor of getting computer scientists to think more intentionally about possible downsides to what they build. That this needs to happen in some form seems inevitable given the visibility of algorithms in different decision-making applications. My observation here is just that there seem to be mixed messages about where the value in the experiment lies to someone who is honestly trying to figure it out. Is reflecting on possible ethical issues thought to be of intrinsic value itself? That seems like the primary intention from what organizers say. But the ambiguity in the review process, and other points I’ve seen argued in talks and panel discussions make me think that transparency around possible implications is thought to be valuable more as a means to an end, i.e., making it easier for others to weigh pros and cons so as to ultimately judge what tech should or shouldn’t be pursued. I think if I were a NeurIPS author asked to write one, the ambiguity about how the statements are meant to be used would make it hard for me to know what incentives I have to do a good job now, or what I should expect in the future as the exercise evolves. From an outside perspective looking in, I get the sense there isn’t yet a clear consensus about what this should be.

Maybe it’s naive and even cliche to be looking for a clear objective function. Or maybe it’s simply too early to expect that. As a computer scientist witnessing these events, though, it’s hard to accept that big changes are occurring but no one knows exactly where it’s all headed. Trusting in something that is portrayed as hard to formalize or intentionally unformalized doesn’t come easy. Although I can say that the lack of answers and confusing messaging is prompting me to read more of the ML ethics lit, so perhaps there’s a silver lining to making a bold move even if the details aren’t all worked out.

I also can’t help but think back to watching the popularization of the replication crisis back in the early 2010’s, from the initial controversy, to growing acceptance that there was a problem, to the sustained search for solutions that will actually help shift both training and incentives. I think we’re still very early in a similar process in computer science, but with many who are eager to put changes in motion. 

As if we needed another example of lying with statistics and not issuing a correction: bike-share injuries

This post is by Phil Price

A Washington Post article says “In the first study of its kind, researchers from Washington State University and elsewhere found  a 14 percent greater risk of head injuries to cyclists associated with cities that have bike share programs. In fact, when they compared raw head injury data for cyclists in five cities before and after they added bike share programs, the researchers found a 7.8 percent increase in the number of head injuries to cyclists.”

Actually that’s not even an example of “how to lie with statistics”, it’s simply an example of “how to lie”: As noted on StreetsBlog, data published in the study show that “In the cities that implemented bike-share…all injuries declined 28 percent, from 757 to 545. Head injuries declined 14 percent, from 319 to 273 per year. And moderate to severe head injuries also declined from 162 to 119. Meanwhile, in the control cities that do not have bike-share, all injuries increased slightly from 932 to 953 per year — 6 percent.”  There’s a nice table on Streetsblog, taken from the study(make sure you read the caption).

So the number of head injuries declined by 14 percent, and the Washington Post reporter — Lenny Bernstein, for those of you keeping score at home — says they went up 7.8%.  That’s a pretty big mistake! How did it happen?  Well, the number of head injuries went down, but the number of injuries that were not head injuries went down even more, so the proportion of head injuries that were head injuries went up.
According to StreetsBlog, University of British Columbia public health professor Kay Ann Teschke “attempted to notify Bernstein of the problem with the article in the comments of the story, and he was initially dismissive. He has since admitted in the comments that she is right, but had not adjusted his piece substantially at the time we published this post.” (I don’t see that exchange in the comments, although I do see that other commenters have pointed out the error).

To be fair to Bernstein, it looks like he may have gotten his bad information straight from the researchers who did the study: The University of Washington’s Health Sciences NewsBeat also says “Risk of head injury among cyclists increased 14 percent after implementation of bike-share programs in several major cities”. It’s hard to fault Bernstein for getting the story wrong if he was just repeating errors that were in a press release approved by one of the study’s authors!

But how do Bernstein, the Washington Post, the study’s author at University of Washington (Janessa Graves), and the University of Washington justify their failure to correct this misinformation?  It’s a major error, and it’s not that hard to edit a web page to insert a correction or retraction.

[Note added June 18: When I posted this I also emailed Bernstein and the UW Health Sciences Newsbeat to give them a heads-up and invite comment. Newsbeat has changed the story to make it clear that the proportion of injuries that are head injuries increased in the bike share cities. They do not note that the number of head injuries decreased, and it looks like they forgot to correct the headline so it’s still wrong. At least they acknowledged the problem and did something, although I daresay most readers of that page will still be misled. But it’s no longer flat wrong. Except the headline.]

Of course, even simply retracting the story is a missed opportunity: the real story here is that injuries went down in bike share cities in spite of the fact that there were more people riding. That’s a surprise!  As a bike commuter, I know that it has long been argued that biking becomes safer per biker-mile as more people ride, because drivers become more alert to the likely presence of bikes. But I would not have expected that the decrease in risk per mile would more than counteract the number of miles ridden, such that the number of injuries would go down. Or, of course, maybe that’s not what happened, maybe there were other changes that were coincident with the introduction of bike share programs, that decreased risk in the bike share cities but not the control cities.

This sort of thing — by which I mean mis-reporting of scientific results in general — is just so, so frustrating and demoralizing to me. If people think bike share programs substantially increase the risk of injury, that belief has consequences. It affects the amount of public support for such programs (and for biking in general) as well as affecting individuals’ decisions about whether or not to use those programs themselves. To see these stories get twisted around, and to see journalists refuse to correct them…grrrrr.

This post is by Phil Price
[Andrew, please add “Ethics” and “Journalism” categories to this blog]