This year NeurIPS, a top machine learning conference, required a broader impacts statement from authors. From the call:
In order to provide a balanced perspective, authors are required to include a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences. Authors should take care to discuss both positive and negative outcomes
I heard that next year ICML, another top ML conference, will add the same requirement.
Questions about how to make computer scientists more thoughtful of the potential societal implications of what they create have been around for awhile, with an increasing number of researchers looking at how to foster reflection or transparency through different design methods, fact sheets to go along with tech contributions, etc. But a requirement that all authors try to address broader societal implications in publications is a new thing. Actions like these are part of a reform movement aimed at shifting computer science values rather drastically away from the conventional view that algorithms and math are outside of any moral philosophy.
Here, I’m not going to take on bigger questions of what this kind of action means or how useful it is, and instead reflect more on how it was done, and what questions I, as a curious outsider (I don’t publish at NeurIPS), have had in looking at the official messaging about the broader impacts statement. It’s felt a bit like doing a puzzle.
While the call doesn’t go into too much detail about how the statements should be written or used, the FAQ for authors says:
Do I have to complete the Broader Impact section? Answer: Yes, please include the section in the submission for review. However, if your work is very theoretical or is general enough that there is no particular application foreseen, then you are free to write that a Broader Impact discussion is not applicable.
So, some acknowledgment of the requirement is required at best. How is it used in the reviewing process?
Can my submission be rejected solely on the basis of the Broader Impact Section? Answer: No. Reviewers will be asked to rate a submission based on the evaluation criteria. They will also be asked to check whether the Broader Impact is adequately addressed. In general, if a paper presents theoretical work without any foreseeable impact in the society, authors can simply state “This work does not present any foreseeable societal consequence”. If a paper presents a method or an application that might have reasonable chances to have some broader impact, authors can discuss along the following lines: “This work has the following potential positive impact in the society…. At the same time, this work may have some negative consequences because… Furthermore, we should be cautious of the result of failure of the system which could cause…”
I checked out the evaluation criteria which are also part of the call, which include this sentence about broader impacts:
Regardless of scientific quality or contribution, a submission may be rejected for ethical considerations, including methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury.
It’s a little ambiguous, but since they say above that a submission cannot be rejected solely on the basis of the Broader Impact section, I assume the reviewer would have to point to other parts of the paper (the work itself?) to argue that there’s a problem. Maybe we are ruling out rejecting cases where the science is sound and reasonably ethical, but the broader impacts statement is badly written?
The FAQ also includes this:
How should I write the Broader Impact section? Answer: For additional motivation and general guidance, read Brent Hecht et al.’s white paper and blogpost, as well as this blogpost from the Centre for Governance of AI. For an example of such a discussion, see sec. 4 in this paper from Gillick et al.
So I looked at some of the links, and the blogpost by Hecht gives some important info about how reviewers are supposed to read these things:
As authors, you’re also likely reviewers (esp. this year!). NeurIPS leaders should probably address this directly, but our proposal’s view is that it’s not your job as a reviewer to judge submissions for their impacts. Rather, you should evaluate the *rigor with which they disclose their impacts*. Our proposal also recommends that reviewers adopt a “big tent” approach as “norms and standards develop”.
So reviewers should judge how rigorously authors reported impacts, but they can’t reject papers on the basis of them. The NeurIPS reviewer guidelines say a little more, basically echoing this point that the reviewer’s judgment is about whether the authors did an adequate job reflecting on positive and negative potential impacts.
Right after this point, the reviewer guidelines mention the broader issue of ethical concerns as a possible reason for paper rejection:
Does the submission raise potential ethical concerns? This includes methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury. If so, please explain briefly.
Yes or No. Explain if the submission might raise any potential ethical concern. Note that your rating should be independent of this. If the AC also shares this concern, dedicated reviewers with expertise at the intersection of ethics and ML will further review the submission. Your duty here is to flag only papers that might need this additional revision step.
Something unsatisfying about this piece for me is that as a reviewer, I’m told to assess the broader impacts only for how well it reports on positive and negative potential, but then I can raise ethical concerns with the paper. In a case where I would not have noticed an ethical concern with the paper, but a convincing broader impacts brings one to my attention, I assume I can then flag the paper for an ethics concern. If the dedicated ethics reviewers described above then decided to reject the paper, I think it’s still fair to say that this would not be an example of the paper being rejected solely on the basis of the broader impacts. But the broader impacts statement could have effectively handed over the matches that started the fire.
This makes me wonder, if I were an author who feels they have some private information about possible ethical consequences that may not be available to their reviewers, based, e.g., on the amount of time I’ve spent thinking about their specific topic, would I be motivated to share it? It seems like it would be slightly safer to keep it to myself, since even if the reviewers are smarter than I think, it shouldn’t make the outcome of my paper any worse.
NeurIPS is now over, so some authors may be looking for extra feedback about the broader impacts statements in the form of actual paper outcomes. NeurIPS program chairs published a Medium post with some stats.
This year, we required authors to include a broader impact statement in their submission. We did not reject any papers on the grounds that they failed to meet this requirement. However, we will strictly require that this section be included in the camera-ready version of the accepted papers. As you can see from the histogram of the number of words in this section, about 9% of the submission did not have such a section, and most submissions had a section with about 100 words.
We appointed an ethics advisor and invited a pool of 22 ethics reviewers (listed here) with expertise in fields such as AI policy, fairness and transparency, and ethics and machine learning. Reviewers could flag papers for ethical concerns, such as submissions with undue risk of harm or methods that might increase unfair bias through improper use of data, etc. Papers that received strong technical reviews yet were flagged for ethical reasons were assessed by the pool of ethics reviewers.
Thirteen papers met these criteria and received ethics reviews. Only four papers were rejected because of ethical considerations, after a thorough assessment that included the original technical reviewers, the area chair, the senior area chair and also the program chairs. Seven papers flagged for ethical concerns were conditionally accepted, meaning that the final decision is pending the assessment of the area chair once the camera ready version is submitted. Some of these papers require a thorough revision of the broader impact section to include a clearer discussion of potential risks and mitigations, and others require changes to the submission such as the removal of problematic datasets. Overall, we believe that the ethics review was a successful and important addition to the review process. Though only a small fraction of papers received detailed ethical assessments, the issues they presented were important and complex and deserved the extended consideration. In addition, we were very happy with the high quality of the assessments offered by the ethics reviewers, and the area chairs and senior area chairs also appreciated the additional feedback.
This seems mostly consistent with what was said in the pre-conference descriptions, but doesn’t seem to rule out the concern that a good broader impacts statement could be the reason a reviewer flags a paper for ethics. This made me wonder more about how the advocates frame the value of this exercise for authors.
So I looked for evidence of what intentions the NeurIPS leadership seemed to have in mind in introducing these changes. I even attended a few NeurIPS workshops that were related, in an effort to become a more informed computer scientist myself. Here’s what I understand the intentions to be:
- To give ML researchers practice reflecting on ethics implications of their work, so they hopefully make more socially responsible tech in the future.
- To encourage more “balanced” reporting on research, i.e., “remove the rose-colored glasses”, as described here.
- To help authors identify future research they could do to help address negative societal outcomes of their work.
- To generate interesting data on how ML researchers react to an ethics prompt.
One thing I’ve heard relatively consistently from advocates is that this is a big experiment. This seems like an honest description, conveying that they recognize that requiring reflection on ethics is a big change to expectations placed on CS researchers and since they don’t know exactly how to best produce this reflection so they’re experimenting.
But implying that it’s all a big open-ended experiment seems counterproductive to building trust among researchers in the organizers’ vision around ethics. Don’t get me wrong, I’m in favor of getting computer scientists to think more intentionally about possible downsides to what they build. That this needs to happen in some form seems inevitable given the visibility of algorithms in different decision-making applications. My observation here is just that there seem to be mixed messages about where the value in the experiment lies to someone who is honestly trying to figure it out. Is reflecting on possible ethical issues thought to be of intrinsic value itself? That seems like the primary intention from what organizers say. But the ambiguity in the review process, and other points I’ve seen argued in talks and panel discussions make me think that transparency around possible implications is thought to be valuable more as a means to an end, i.e., making it easier for others to weigh pros and cons so as to ultimately judge what tech should or shouldn’t be pursued. I think if I were a NeurIPS author asked to write one, the ambiguity about how the statements are meant to be used would make it hard for me to know what incentives I have to do a good job now, or what I should expect in the future as the exercise evolves. From an outside perspective looking in, I get the sense there isn’t yet a clear consensus about what this should be.
Maybe it’s naive and even cliche to be looking for a clear objective function. Or maybe it’s simply too early to expect that. As a computer scientist witnessing these events, though, it’s hard to accept that big changes are occurring but no one knows exactly where it’s all headed. Trusting in something that is portrayed as hard to formalize or intentionally unformalized doesn’t come easy. Although I can say that the lack of answers and confusing messaging is prompting me to read more of the ML ethics lit, so perhaps there’s a silver lining to making a bold move even if the details aren’t all worked out.
I also can’t help but think back to watching the popularization of the replication crisis back in the early 2010’s, from the initial controversy, to growing acceptance that there was a problem, to the sustained search for solutions that will actually help shift both training and incentives. I think we’re still very early in a similar process in computer science, but with many who are eager to put changes in motion.