Maybe you’ve addressed this elsewhere but isn’t scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.
Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.
I think we should broadly aim for AIs which are myopic and don’t scheme with early transformative AIs rather than aiming for AIs which are aligned in their long run goals/motives.
Maybe you’ve addressed this elsewhere but isn’t scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.
Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.
I think we should broadly aim for AIs which are myopic and don’t scheme with early transformative AIs rather than aiming for AIs which are aligned in their long run goals/motives.