This agenda by the Effective Altruism Foundation focuses on risks of astronomical suffering (s-risks) posed by <@transformative AI@ >(@Defining and Unpacking Transformative AI@) (TAI) and especially those related to conflicts between powerful AI agents. This is because there is a very clear path from extortion and executed threats against altruistic values to s-risks. While especially important in the context of s-risks, cooperation between AI systems is also relevant from a range of different viewpoints. The agenda covers four clusters of topics: strategy, credibility and bargaining, current AI frameworks, as well as decision theory.
The extent of cooperation failures is likely influenced by how power is distributed after the transition to TAI. At first glance, it seems like widely distributed scenarios (as <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@)) are more problematic, but related literature from international relations paints a more complicated picture. The agenda seeks a better understanding of how the distribution of power affects catastrophic risk, as well as potential levers to influence this distribution. Other topics in the strategy/governance cluster include the identification and analysis of realistic scenarios for misalignment, as well as case studies on cooperation failures in humans and how they can be affected by policy.
TAI might enable unprecedented credibility, for example by being very transparent, which is crucial for both contracts and threats. The agenda aims at better models of the effects of credibility on cooperation failures. One approach to this is open-source game theory, where agents can see other agents’ source codes. Promising approaches to prevent catastrophic cooperation failures include the identification of peaceful bargaining mechanisms, as well as surrogate goals. The idea of surrogate goals is for an agent to commit to act as if it had a different goal, whenever it is threatened, in order to protect its actual goal from threats.
As some aspects of contemporary AI architectures might still be present in TAI, it can be useful to study cooperation failure in current systems. One concrete approach to enabling cooperation in social dilemmas that could be tested with contemporary systems is based on bargaining over policies combined with punishments for deviations. Relatedly, it is worth investigating whether or not multi-agent training leads to human-like bargaining by default. This has implications on the suitability of behavioural vs classical game theory to study TAI. The behavioural game theory of human-machine interactions might also be important, especially in human-in-the-loop scenarios of TAI.
The last cluster discusses the implications of bounded computation on decision theory as well as the decision theories (implicitly) used by current agent architectures. Another focus lies on acausal reasoning and in particular the possibility of acausal trade, where different correlated AI’s cooperate without any causal links between them.
Flo’s opinion:
I am broadly sympathetic to the focus on preventing the worst outcomes and it seems plausible that extortion could play an important role in these, even though I worry more about distributional shift plus incorrigibility. Still, I am excited about the focus on cooperation, as this seems robustly useful for a wide range of scenarios and most value systems.
My opinion:
Under a suffering-focused ethics under which s-risks far overwhelm x-risks, I think it makes sense to focus on this agenda. There don’t seem to be many plausible paths to s-risks: by default, we shouldn’t expect them, because it would be quite surprising for an amoral AI system to think it was particularly useful or good for humans to _suffer_, as opposed to not exist at all, and there doesn’t seem to be much reason to expect an immoral AI system. Conflict and the possibility of carrying out threats are the most plausible ways by which I could see this happening, and the agenda here focuses on neglected problems in this space.
However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other safety research to be more impactful, because the failure mode of an amoral AI system that doesn’t care about you seems both more likely and more amenable to technical safety approaches (to me at least).
There don’t seem to be many plausible paths to s-risks: by default, we shouldn’t expect them, because it would be quite surprising for an amoral AI system to think it was particularly useful or good for humans to _suffer_, as opposed to not exist at all, and there doesn’t seem to be much reason to expect an immoral AI system.
I think this is probably false, but it’s because I’m using the strict definition of s-risk.
I expect to the extent that there’s any human-like stuff, or animal-like stuff in the future, the fact that there will also be so much more computation available implies that even proportionally small risks of suffering add up to greater aggregates of suffering than currently exist on Earth.
If 0.01% of an intergalactic civilization’s resources were being used to host suffering programs, such as nature simulations, or extremely realistic video games, then this would certainly qualify as an s-risk, via the definition given here, “S-risks are events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.”
If you define s-risks as situations where proportionally large amounts of computation are focused on creating suffering, then I would agree with you. However, s-risks could still maybe be important because they could be unusually tractable. One reason might be that having even just a very small group of people who strongly don’t want suffering to exist would successfully lobby society’s weak preference to have proportionally small amounts of suffering. Suffering might be unique among values in this respect because there might be other places where people would want to fight you more.
the failure mode of an amoral AI system that doesn’t care about you seems both more likely and more amenable to technical safety approaches (to me at least).
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”. A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like “under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?”. Such questions seem relevant under a very wide range of moral systems (including ones that don’t place much weight on s-risks).
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”.
I still wouldn’t recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I’m claiming that the agenda is totally useless for anything besides s-risks, which I certainly don’t believe. I’ve changed that second paragraph to:
However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other technical safety research to be more impactful, because other approaches can more directly target the failure mode of an amoral AI system that doesn’t care about you, which seems both more likely and more amenable to technical safety approaches (to me at least). I could imagine work on this agenda being quite important for _strategy_ research, though I am far from an expert here.
Flo’s summary for the Alignment Newsletter:
Flo’s opinion:
My opinion:
I think this is probably false, but it’s because I’m using the strict definition of s-risk.
I expect to the extent that there’s any human-like stuff, or animal-like stuff in the future, the fact that there will also be so much more computation available implies that even proportionally small risks of suffering add up to greater aggregates of suffering than currently exist on Earth.
If 0.01% of an intergalactic civilization’s resources were being used to host suffering programs, such as nature simulations, or extremely realistic video games, then this would certainly qualify as an s-risk, via the definition given here, “S-risks are events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.”
If you define s-risks as situations where proportionally large amounts of computation are focused on creating suffering, then I would agree with you. However, s-risks could still maybe be important because they could be unusually tractable. One reason might be that having even just a very small group of people who strongly don’t want suffering to exist would successfully lobby society’s weak preference to have proportionally small amounts of suffering. Suffering might be unique among values in this respect because there might be other places where people would want to fight you more.
Yeah, I should have said something like “the biggest kinds of s-risks where there is widespread optimization for suffering”.
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”. A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like “under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?”. Such questions seem relevant under a very wide range of moral systems (including ones that don’t place much weight on s-risks).
I still wouldn’t recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I’m claiming that the agenda is totally useless for anything besides s-risks, which I certainly don’t believe. I’ve changed that second paragraph to: