Much of safety research focuses on a single agent that is directly incentivized by a loss/reward function to take particular actions. This sequence instead considers safety in the case of multi-agent systems interacting in complex environments. In this situation, even simple reward functions can yield complex and highly intelligent behaviors that are only indirectly related. For example, evolution led to humans who can learn to play chess, despite the fact that the ancestral environment did not contain chess games. In these situations, the problem is not how to construct an aligned reward function, the problem is how to shape the experience that the agent gets at training time such that the final agent policy optimizes for the goals that we want. This sequence lays out some considerations and research directions for safety in such situations.
One approach is to teach agents the generalizable skill of obedience. To accomplish this, one could design the environment to incentivize specialization. For instance, if an agent A is more powerful than agent B, but can see less of the environment than B, A might be incentivized to obey B’s instructions if they share a goal. Similarly we can increase the ease and value of coordination through enabling access to a shared permanent record or designing tasks that require large-scale coordination.
A second approach is to move agents to simpler and safer training regimes as they develop more intelligence. The key assumption here is that we may require complex regimes such as competitive multi-agent environments to jumpstart intelligent behavior, but may be able to continue training in a simpler regime such as single-task RL later. This is similar to current approaches for training a language model via supervised learning and then finetuning with RL, but going in the opposite direction to increase safety rather than capabilities.
A third approach is specific to a collective AGI: an AGI that is composed of a number of separate general agents trained on different objectives that learn to cooperatively solve harder tasks. This is similar to how human civilization is able to accomplish much more than any individual human. In this regime, the AGI can be effectively sandboxed by either reducing the population size or by limiting communication channels between the agents. One advantage of this approach to sandboxing is that it allows us to change the effective intelligence of the system at test-time, without going through a potentially expensive retraining phase.
Nicholas’s opinion:
I agree that we should put more emphasis on the safety of multi-agent systems. We already have <@evidence@>(@Emergent Tool Use from Multi-Agent Interaction@) that complex behavior can arise from simple objectives in current systems, and this seems only more likely as systems become more powerful. Two-agent paradigms such as GANs, self-play, and debate, are already quite common in ML. Lastly, humans evolved complex behavior from the simple process of evolution so we have at least one example of this working. I also think this is an interesting area where there is lots to learn from other fields, such as game theory and evolutionary biology,
For any empirically-minded readers of this newsletter, I think this sequence opens up a lot of potential for research. The development of safety benchmarks for multi-agent systems and then the evaluation of these approaches seems like it would make many of the considerations discussed here more concrete. I personally would find them much more convincing with empirical evidence to back up that they work with current ML.
My opinion:
The AGI model here in which powerful AI systems arise through multiagent interaction is an important and plausible one, and I’m excited to see some initial thoughts about it. I don’t particularly expect any of these ideas to be substantially useful, but I’m also not confident that they won’t be useful, and given the huge amount of uncertainty about how multiagent interaction shapes agents, that may be the best we can hope for currently. I’d be excited to see empirical results testing some of these ideas out, as well as more conceptual posts suggesting more ideas to try.
Nicholas’s summary for the Alignment Newsletter:
Nicholas’s opinion:
My opinion: