Making the (tenuous) assumption that humans remain in control of AGI, won’t it just be an absolute shitshow of attempted power grabs over who gets to tell the AGI what to do? For example, supposing OpenAI is the first to AGI, is it really plausible that Sam Altman will be the one actually in charge when there will have been multiple researchers interacting with the model much earlier and much more frequently? I have a hard time believing every researcher will sit by and watch Sam Altman become more powerful than anyone ever dreamed of when there’s a chance they’re a prompt away from having that power for themselves.
You’re assuming that: - There is a single AGI instance running. - There will be a single person telling that AGI what to do - The AGI’s obedience to this person will be total.
I can see these assumptions holding approximately true if we get really really good at corrigibility and if at the same time running inference on some discontinuously-more-capable future model is absurdly expensive. I don’t find that scenario very likely, though.
I see no reason why any of these will be true at first. But the end-goal for many rational agents in this situation would be to make sure 2 and 3 are true.
Making the (tenuous) assumption that humans remain in control of AGI, won’t it just be an absolute shitshow of attempted power grabs over who gets to tell the AGI what to do? For example, supposing OpenAI is the first to AGI, is it really plausible that Sam Altman will be the one actually in charge when there will have been multiple researchers interacting with the model much earlier and much more frequently? I have a hard time believing every researcher will sit by and watch Sam Altman become more powerful than anyone ever dreamed of when there’s a chance they’re a prompt away from having that power for themselves.
You’re assuming that:
- There is a single AGI instance running.
- There will be a single person telling that AGI what to do
- The AGI’s obedience to this person will be total.
I can see these assumptions holding approximately true if we get really really good at corrigibility and if at the same time running inference on some discontinuously-more-capable future model is absurdly expensive. I don’t find that scenario very likely, though.
I see no reason why any of these will be true at first. But the end-goal for many rational agents in this situation would be to make sure 2 and 3 are true.
Correct, those goals are instrumentally convergent.