Is the following scenario a good example of the sort of problem you have in mind? Say you have two advanced ML systems with values that are partially, but not entirely, aligned with humanity: their utility function is 0.9 * (human values) + 0.1 * (control of resources). These two ML systems have been trained with advanced RL, in such a fashion that, when interacting with other powerful systems, they learn to play Nash equilibria. The only Nash equilibrium of their interaction is one where they ruthlessly compete for resources, making the Earth uninhabitable in the process. So both systems are “pretty much aligned”, but their joint interaction is radically unaligned. If this seems like a reasonable example, two thoughts:
A) I think other people in this discussion might be envisioning ‘aligned AI’ as looking more like an approval-directed agent, rather than a system trained with RL on a proxy for the human utility function. Crucially, in this paradigm the system’s long-term planning and bargaining are emergent consequences of what it predicts an (amplified) human would evaluate highly, they’re not baked into the RL algorithm itself. This means it would only try to play a Nash equilibrium if it thinks humans would value that highly, which, in this scenario, they would not. In approval-directed AI systems, or more generally systems where strategic behavior is an emergent consequence of some other algorithm, bargaining ability should rise in tandem with general capability, making it unlikely that very powerful systems would have ‘obvious’ bargaining failures.
B) It seems that systems that are bad at bargaining would also be worse at acquiring resources. For instance, maybe the Nash equilibrium of the above interaction of two RL agents would actually be more like ‘try to coordinate a military strike against the other AI as soon as possible’, leaving both systems crippled, or to a unipolar scenario(which would be OK given the systems’ mostly-aligned utility functions). The scenarios in the post seem to envision systems with some ability to bargain with others, but only for certain parts of their utility function, maybe those that are simple to measure. I think it might be worth emphasizing that more, or describing what kind of RL algorithms would give rise to bargaining abilities that look like that.
Is the following scenario a good example of the sort of problem you have in mind? Say you have two advanced ML systems with values that are partially, but not entirely, aligned with humanity: their utility function is 0.9 * (human values) + 0.1 * (control of resources). These two ML systems have been trained with advanced RL, in such a fashion that, when interacting with other powerful systems, they learn to play Nash equilibria. The only Nash equilibrium of their interaction is one where they ruthlessly compete for resources, making the Earth uninhabitable in the process. So both systems are “pretty much aligned”, but their joint interaction is radically unaligned. If this seems like a reasonable example, two thoughts:
A) I think other people in this discussion might be envisioning ‘aligned AI’ as looking more like an approval-directed agent, rather than a system trained with RL on a proxy for the human utility function. Crucially, in this paradigm the system’s long-term planning and bargaining are emergent consequences of what it predicts an (amplified) human would evaluate highly, they’re not baked into the RL algorithm itself. This means it would only try to play a Nash equilibrium if it thinks humans would value that highly, which, in this scenario, they would not. In approval-directed AI systems, or more generally systems where strategic behavior is an emergent consequence of some other algorithm, bargaining ability should rise in tandem with general capability, making it unlikely that very powerful systems would have ‘obvious’ bargaining failures.
B) It seems that systems that are bad at bargaining would also be worse at acquiring resources. For instance, maybe the Nash equilibrium of the above interaction of two RL agents would actually be more like ‘try to coordinate a military strike against the other AI as soon as possible’, leaving both systems crippled, or to a unipolar scenario(which would be OK given the systems’ mostly-aligned utility functions). The scenarios in the post seem to envision systems with some ability to bargain with others, but only for certain parts of their utility function, maybe those that are simple to measure. I think it might be worth emphasizing that more, or describing what kind of RL algorithms would give rise to bargaining abilities that look like that.