Andrew McKnight comments on Catching AIs red-handed

Andrew McKnight 25 May 2024 22:55 UTC
2 points
0
Maybe you’ve addressed this elsewhere but isn’t scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.
Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.
- ryan_greenblatt 26 May 2024 3:25 UTC
  6 points
  4
  Parent
  I think we should broadly aim for AIs which are myopic and don’t scheme with early transformative AIs rather than aiming for AIs which are aligned in their long run goals/motives.