So, we need to make it so single misaligned AI could be defeated by other AIs fast. Ideally before it can do any damage. Also, misagnment with human values ideally should not cause AI going on rampage, but staying harmless to avoid being stomped by other AIs. Of cause, it should be combined with other means of alignment, so misalignment could be noticed and fixed.
I’m currently thinking is it possible to implement that using Subagents approach, i.e. split control over each decisions between several models, with each one having a right of veto.
make it so single misaligned AI could be defeated by other AIs fast
It might be difficult for AIs with complicated implicit values to build maximally capable AIs aligned with them. This would motivate them to remain at a lower level of capabilities, taking their time to improve capabilities without an alignment catastrophe. At the same time, AIs with simple explicit values might be in a better position to do that, being able to construct increasingly capable AIs that have the same simple explicit values.
Since AIs aligned with humanity probably need complicated values, the initial shape of a secure aligned equilibrium probably looks more like strong coordination and containment than pervasive maximal capability.
Of course, having less limitations gives an advantage. Though, respecting limitations aimed at well-being of entire community makes it easier to coordinate and cooperate. And it works not just for AIs.
So, we need to make it so single misaligned AI could be defeated by other AIs fast. Ideally before it can do any damage. Also, misagnment with human values ideally should not cause AI going on rampage, but staying harmless to avoid being stomped by other AIs. Of cause, it should be combined with other means of alignment, so misalignment could be noticed and fixed.
I’m currently thinking is it possible to implement that using Subagents approach, i.e. split control over each decisions between several models, with each one having a right of veto.
It might be difficult for AIs with complicated implicit values to build maximally capable AIs aligned with them. This would motivate them to remain at a lower level of capabilities, taking their time to improve capabilities without an alignment catastrophe. At the same time, AIs with simple explicit values might be in a better position to do that, being able to construct increasingly capable AIs that have the same simple explicit values.
Since AIs aligned with humanity probably need complicated values, the initial shape of a secure aligned equilibrium probably looks more like strong coordination and containment than pervasive maximal capability.
Of course, having less limitations gives an advantage. Though, respecting limitations aimed at well-being of entire community makes it easier to coordinate and cooperate. And it works not just for AIs.