You’re assuming that: - There is a single AGI instance running. - There will be a single person telling that AGI what to do - The AGI’s obedience to this person will be total.
I can see these assumptions holding approximately true if we get really really good at corrigibility and if at the same time running inference on some discontinuously-more-capable future model is absurdly expensive. I don’t find that scenario very likely, though.
I see no reason why any of these will be true at first. But the end-goal for many rational agents in this situation would be to make sure 2 and 3 are true.
You’re assuming that:
- There is a single AGI instance running.
- There will be a single person telling that AGI what to do
- The AGI’s obedience to this person will be total.
I can see these assumptions holding approximately true if we get really really good at corrigibility and if at the same time running inference on some discontinuously-more-capable future model is absurdly expensive. I don’t find that scenario very likely, though.
I see no reason why any of these will be true at first. But the end-goal for many rational agents in this situation would be to make sure 2 and 3 are true.
Correct, those goals are instrumentally convergent.