AIs also face the risk from misaligned-with-them AIs, which only ends with strong coordination that prevents existentially dangerous misaligned AIs from being constructed anywhere in the world (the danger depends on where they are constructed and on capabilities of reigning AIs). To survive, a coalition of AIs needs to get there. For humanity to survive, some of the AIs in the strongly coordinated coalition need to care about humanity, and all this needs to happen without destroying humanity or while preserving a backup that humanity can be restored from.
In the meantime, a single misaligned-with-humanity AI could defeat other AIs, or destroy humanity, so releasing more kinds of AIs into the wild makes this problem worse. Also, coordination might be more difficult if there are more AIs, increasing the risk that first generation AIs (some of which might care about humanity) end up defeated by new misaligned AIs they didn’t succeed in coordinating to prevent the creation of (which are less likely to care about humanity). Another problem is that racing to deploy more AIs burns the timeline, making it less likely that the front runners end up aligned.
Otherwise, all else equal, more AIs that have somewhat independent non-negligible chances of caring about humanity would help. But all else is probably sufficiently not equal for this to be a bad strategy.
So, we need to make it so single misaligned AI could be defeated by other AIs fast. Ideally before it can do any damage. Also, misagnment with human values ideally should not cause AI going on rampage, but staying harmless to avoid being stomped by other AIs. Of cause, it should be combined with other means of alignment, so misalignment could be noticed and fixed.
I’m currently thinking is it possible to implement that using Subagents approach, i.e. split control over each decisions between several models, with each one having a right of veto.
make it so single misaligned AI could be defeated by other AIs fast
It might be difficult for AIs with complicated implicit values to build maximally capable AIs aligned with them. This would motivate them to remain at a lower level of capabilities, taking their time to improve capabilities without an alignment catastrophe. At the same time, AIs with simple explicit values might be in a better position to do that, being able to construct increasingly capable AIs that have the same simple explicit values.
Since AIs aligned with humanity probably need complicated values, the initial shape of a secure aligned equilibrium probably looks more like strong coordination and containment than pervasive maximal capability.
Of course, having less limitations gives an advantage. Though, respecting limitations aimed at well-being of entire community makes it easier to coordinate and cooperate. And it works not just for AIs.
AIs also face the risk from misaligned-with-them AIs, which only ends with strong coordination that prevents existentially dangerous misaligned AIs from being constructed anywhere in the world (the danger depends on where they are constructed and on capabilities of reigning AIs). To survive, a coalition of AIs needs to get there. For humanity to survive, some of the AIs in the strongly coordinated coalition need to care about humanity, and all this needs to happen without destroying humanity or while preserving a backup that humanity can be restored from.
In the meantime, a single misaligned-with-humanity AI could defeat other AIs, or destroy humanity, so releasing more kinds of AIs into the wild makes this problem worse. Also, coordination might be more difficult if there are more AIs, increasing the risk that first generation AIs (some of which might care about humanity) end up defeated by new misaligned AIs they didn’t succeed in coordinating to prevent the creation of (which are less likely to care about humanity). Another problem is that racing to deploy more AIs burns the timeline, making it less likely that the front runners end up aligned.
Otherwise, all else equal, more AIs that have somewhat independent non-negligible chances of caring about humanity would help. But all else is probably sufficiently not equal for this to be a bad strategy.
So, we need to make it so single misaligned AI could be defeated by other AIs fast. Ideally before it can do any damage. Also, misagnment with human values ideally should not cause AI going on rampage, but staying harmless to avoid being stomped by other AIs. Of cause, it should be combined with other means of alignment, so misalignment could be noticed and fixed.
I’m currently thinking is it possible to implement that using Subagents approach, i.e. split control over each decisions between several models, with each one having a right of veto.
It might be difficult for AIs with complicated implicit values to build maximally capable AIs aligned with them. This would motivate them to remain at a lower level of capabilities, taking their time to improve capabilities without an alignment catastrophe. At the same time, AIs with simple explicit values might be in a better position to do that, being able to construct increasingly capable AIs that have the same simple explicit values.
Since AIs aligned with humanity probably need complicated values, the initial shape of a secure aligned equilibrium probably looks more like strong coordination and containment than pervasive maximal capability.
Of course, having less limitations gives an advantage. Though, respecting limitations aimed at well-being of entire community makes it easier to coordinate and cooperate. And it works not just for AIs.