Here is the summary from the ML Safety Newsletter #11 Top Safety Papers of 2023.
Instrumental Convergence? questions the argument that a rational agent, regardless of its terminal goal, will seek power and dominance. While there are instrumental incentives to seek power, it is not always instrumentally rational to seek it. While there are incentives to become a billionaire, but it is not necessarily rational for everyone to try to become one. Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs, making it often irrational to pursue dominance. Pursuing and maintaining power is costly, and simpler actions often are more rational. Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper.
“Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs” --> I’ve not read the whole thing, but what about collusion?
“Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper.” --> This is pretty weak imho. Yes, you can fine tune GPTs to be honest, etc, or you can filter the dataset with only honest stuff, but at the end of the day, Cicero still learned strategic deception, and there is no principled way to avoid that completely.
Here is the summary from the ML Safety Newsletter #11 Top Safety Papers of 2023.
“Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs” --> I’ve not read the whole thing, but what about collusion?
“Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper.” --> This is pretty weak imho. Yes, you can fine tune GPTs to be honest, etc, or you can filter the dataset with only honest stuff, but at the end of the day, Cicero still learned strategic deception, and there is no principled way to avoid that completely.