Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.
Note that he claims that this may be true even if single/single alignment is solved, and all AGIs involved are aligned to their respective users.
It strikes me as interesting that much of the existing work that’s been done on multiagent training, such as it is, focusses on just examining the behaviour of artificial agents in social dilemmas. The thinking seems to be—and this was also suggested in ARCHES—that it’s useful just for exploratory purposes to try to characterise how and whether RL agents cooperate in social dilemmas, what mechanism designs and what agent designs promote what types of cooperation, and if there are any general trends in terms of what kinds of multiagent failures RL tends to fall into.
One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogous to developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them.
There seems to be an implicit assumption here that something very important and unique to multiagent situations would be uncovered—by analogy to things like the flash crash. It’s not clear to me that we’ve examined the intersection of RL and social dilemmas enough to notice if this were true, if it were true, and I think that’s the major justification for working on this area.
It strikes me as interesting that much of the existing work that’s been done on multiagent training, such as it is, focusses on just examining the behaviour of artificial agents in social dilemmas. The thinking seems to be—and this was also suggested in ARCHES—that it’s useful just for exploratory purposes to try to characterise how and whether RL agents cooperate in social dilemmas, what mechanism designs and what agent designs promote what types of cooperation, and if there are any general trends in terms of what kinds of multiagent failures RL tends to fall into.
For example, it’s generally known that regular RL tends to fail to cooperate in social dilemmas, ‘Unfortunately, selfish MARL agents typically fail when faced with social dilemmas’. From ARCHES:
There seems to be an implicit assumption here that something very important and unique to multiagent situations would be uncovered—by analogy to things like the flash crash. It’s not clear to me that we’ve examined the intersection of RL and social dilemmas enough to notice if this were true, if it were true, and I think that’s the major justification for working on this area.
Strongly agree that it’s unclear that there failures would be detected.
For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm