This long post explains the author’s beliefs about a variety of research topics relevant to AI existential safety. First, let’s look at some definitions.
While AI safety alone just means getting AI systems to avoid risks (including e.g. the risk of a self-driving car crashing), _AI existential safety_ means preventing AI systems from posing risks at least as bad as human extinction. _AI alignment_ on the other hand is about getting an AI system to try to / succeed at doing what a person or institution wants them to do. (The “try” version is _intent alignment_, while the “succeed” version is _impact alignment_.)
Note that AI alignment is not the same thing as AI existential safety. In addition, the author makes the stronger claim that it is insufficient to guarantee AI existential safety, because AI alignment tends to focus on situations involving a single human and a single AI system, whereas AI existential safety requires navigating systems involving multiple humans and multiple AI systems. Just as AI alignment researchers worry that work on AI capabilities for useful systems doesn’t engage enough with the difficulty of alignment, the author worries that work on alignment doesn’t engage enough with the difficulty of multiagent systems.
The author also defines _AI ethics_ as the principles that AI developers and systems should follow, and _AI governance_ as identifying _and enforcing_ norms for AI developers and systems to follow. While ethics research may focus on resolving disagreements, governance will be more focused on finding agreeable principles and putting them into practice.
Let’s now turn to how to achieve AI existential safety. The main mechanism the author sees is to _anticipate, legitimize, and fulfill governance demands_ for AI technology. Roughly, governance demands are those properties which there are social and political pressures for, such as “AI systems should be fair” or “AI systems should not lead to human extinction”. If we can _anticipate_ these demands in advance, then we can do technical work on how to _fulfill_ or meet these demands, which in turn _legitimizes_ them, that is, it makes it clearer that the demand can be fulfilled and so makes it easier to create common knowledge that it is likely to become a legal or professional standard.
We then turn to various different fields of research, which the author ranks on three axes: helpfulness to AI existential safety (including potential negative effects), educational value, and neglectedness. Note that for educational value, the author is estimating the benefits of conducting research on the topic _to the researcher_, and not to (say) the rest of the field. I’ll only focus on helpfulness to AI existential safety below, since that’s what I’m most interested in (it’s where the most disagreement is, and so where new arguments are most useful), but I do think all three axes are important.
The author ranks both preference learning and out of distribution robustness lowest on helpfulness to existential safety (1/10), primarily because companies already have a strong incentive to have robust AI systems that understand preferences.
Multiagent reinforcement learning (MARL) comes only slightly higher (2/10), because since it doesn’t involve humans its main purpose seems to be to deploy fleets of agents that may pose risks to humanity. It is possible that MARL research could help by producing <@cooperative agents@>(@Cooperative AI Workshop@), but even this carries its own risks.
Agent foundations is especially dual-use in this framing, because it can help us understand the big multiagent system of interactions, and there isn’t a restriction on how that understanding could be used. It consequently gets a low score (3/10), that is a combination of “targeted applications could be very useful” and “it could lead to powerful harmful forces”.
Minimizing side effects starts to address the challenges the author sees as important (4/10): in particular, it can allow us both to prevent accidents, where an AI system “messes up”, and it can help us prevent externalities (harms to people other than the primary stakeholders), which are one of the most challenging issues in regulating multiagent systems.
Fairness is valuable for the obvious reason: it is a particular governance demand that we have anticipated, and research on it now will help fulfill and legitimize that demand. In addition, research on fairness helps get people to think at a societal scale, and to think about the context in which AI systems are deployed. It may also help prevent centralization of power from deployment of AI systems, since that would be an unfair outcome.
The author would love it if AI/ML pivoted to frequently think about real-life humans and their desires, values and vulnerabilities. Human-robot interaction (HRI) is a great way to cause more of that to happen, and that alone is valuable enough that the author assigns it 6⁄10, tying it with fairness.
As we deploy more and more powerful AI systems, things will eventually happen too quickly for humans to monitor. As a result, we will need to also automate the process of governance itself. The area of computational social choice is well-posed to make this happen (7/10), though certainly current proposals are insufficient and more research is needed.
Accountability in ML is good (8/10) primarily because as we make ML systems accountable, we will likely also start to make tech companies accountable, which seems important for governance. In addition, in a <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@) scenario, better accountability mechanisms seem likely to help in ensuring that the various AI systems remain accountable, and thus safer, to human society.
Finally, interpretability is useful (8/10) for the obvious reasons: it allows developers to more accurately judge the properties of systems they build, and helps in holding developers and systems accountable. But the most important reason may be that interpretable systems can make it significantly easier for competing institutions and nations to establish cooperation around AI-heavy operations.
There is then an additional question of whether the best thing to do to improve AI existential safety is to work on technical approaches to governance challenges. There’s some pushback on this claim in the comments that I agree with; I recommend reading through it. It seems like the core disagreement is on the relative importance of risks: in particular, it sounds like the author thinks that existing incentives for preference learning and out-of-distribution robustness are strong enough that we mostly don’t have to worry about it, whereas governance will be much more challenging; I disagree with at least that relative ranking.
It’s possible that we agree on the strength of existing incentives—I’ve <@claimed@>(@Conversation with Rohin Shah@) a risk of 10% for existentially bad failures of intent alignment if there is no longtermist intervention, primarily because of existing strong incentives. That could be consistent with this post, in which case we’d disagree primarily on whether the “default” governance solutions are sufficient for handling AI risk, where I’m a lot more optimistic than the author.
Planned summary for the Alignment Newsletter:
Planned opinion: