modifying your goals/values is part of the easiest strategy for empowering you, since the easiest strategy for empowering you involves changing you massively to make you smarter,
The likely easiest strategy for the AI to do anything is to make itself smarter. You are making several unjustified assumptions:
That upgrading the intelligence of existing humans is easier than upgrading the intelligence of the AI. This seems extraordinarily unlikely—especially given the starting assumption of AGI. Even if the ‘humans’ were already uploads, it would at most be a wash.
That upgrading the intelligence of existing humans would probably break a bunch of other things in the process, including goals/values.
Even if we grant 2 (which is hardly obvious given how the brain works—for example it should be fairly easy to add new neural power to uploads (in the form of new cortical/cerebellar modules), after we have mastered brain-like AGI), that only contradicts 1: If upgrading the intelligence of existing humans risks breaking things, this is only all the more reason to instead upgrade the intelligence of the AI.
If anything there’s more risk of the opposite of your concern—that the AI would not upgrade our intelligence as much as we’d like. But it’s also not obvious how much of a problem that is, unless humans have some desire for intelligence for non-instrumental reasons, as otherwise the AI upgrading itself should be equivalent to upgrading us. This type of AI really is an extension of our will, and difficult to distinguish from an exocortex-like brain upgrade.
Also, to be clear—I am not super confident that a pure external empowerment AI is exactly the best option, but I’m fairly confident it is enormously better than a self-empowering AI, and the context here is refuting the assumption that failure to explicitly learn complex fragile human values results in AI that kills us all or worse.
Ahhhhhhhhh, interesting! I do certainly agree that upgrading AIs own intelligence is easier than upgrading human intelligence. Hadn’t thought of that. I still firmly hold opinion 2 though.
Let’s think step by step:
It does seem plausible that the easiest & most robust way to empower a given human, in the sense of making it possible for them to achieve arbitrary goals, if you are a powerful AGI, is to build another powerful AGI like yourself but corrigible (or maybe indirectly aligned?) to the human, so that when you hypothetically/counterfactually vary the goals of the human, they still end up getting achieved.
If so, then yeah, seems like making an aligned/corrigible AI reduces to the problem of making the AI optimize for human empowerment.
This is plausibly somewhat easier than solving alignment/corrigibility, because it’s maybe a simpler concept to point to. The concept of “what this human truly wants” is maybe pretty niche and complex whereas the concept of “empower this human” is less so. So if you are hitting the system with SGD until it scores well in your training environment, you might be able to bonk it into a shape where it is trying to empower this human more easily than you can bonk it into a shape where it is trying to do what this human truly wants. You still have to worry about the classic problems (e.g. maybe “do whatever will get high reward in training, then later do whatever you want” is even simpler/more natural and will overwhelmingly be the result you get instead in both cases) but the situation seems somewhat improved at least.
Thanks, I hope people think more about this. I guess it feels like it might help, though I guess we already had corrigibility and honesty as simpler concepts that we could bootstrap with and this is just a third (and probably not as good) one.
The likely easiest strategy for the AI to do anything is to make itself smarter. You are making several unjustified assumptions:
That upgrading the intelligence of existing humans is easier than upgrading the intelligence of the AI. This seems extraordinarily unlikely—especially given the starting assumption of AGI. Even if the ‘humans’ were already uploads, it would at most be a wash.
That upgrading the intelligence of existing humans would probably break a bunch of other things in the process, including goals/values.
Even if we grant 2 (which is hardly obvious given how the brain works—for example it should be fairly easy to add new neural power to uploads (in the form of new cortical/cerebellar modules), after we have mastered brain-like AGI), that only contradicts 1: If upgrading the intelligence of existing humans risks breaking things, this is only all the more reason to instead upgrade the intelligence of the AI.
If anything there’s more risk of the opposite of your concern—that the AI would not upgrade our intelligence as much as we’d like. But it’s also not obvious how much of a problem that is, unless humans have some desire for intelligence for non-instrumental reasons, as otherwise the AI upgrading itself should be equivalent to upgrading us. This type of AI really is an extension of our will, and difficult to distinguish from an exocortex-like brain upgrade.
Also, to be clear—I am not super confident that a pure external empowerment AI is exactly the best option, but I’m fairly confident it is enormously better than a self-empowering AI, and the context here is refuting the assumption that failure to explicitly learn complex fragile human values results in AI that kills us all or worse.
Ahhhhhhhhh, interesting! I do certainly agree that upgrading AIs own intelligence is easier than upgrading human intelligence. Hadn’t thought of that. I still firmly hold opinion 2 though.
Let’s think step by step:
It does seem plausible that the easiest & most robust way to empower a given human, in the sense of making it possible for them to achieve arbitrary goals, if you are a powerful AGI, is to build another powerful AGI like yourself but corrigible (or maybe indirectly aligned?) to the human, so that when you hypothetically/counterfactually vary the goals of the human, they still end up getting achieved.
If so, then yeah, seems like making an aligned/corrigible AI reduces to the problem of making the AI optimize for human empowerment.
This is plausibly somewhat easier than solving alignment/corrigibility, because it’s maybe a simpler concept to point to. The concept of “what this human truly wants” is maybe pretty niche and complex whereas the concept of “empower this human” is less so. So if you are hitting the system with SGD until it scores well in your training environment, you might be able to bonk it into a shape where it is trying to empower this human more easily than you can bonk it into a shape where it is trying to do what this human truly wants. You still have to worry about the classic problems (e.g. maybe “do whatever will get high reward in training, then later do whatever you want” is even simpler/more natural and will overwhelmingly be the result you get instead in both cases) but the situation seems somewhat improved at least.
Thanks, I hope people think more about this. I guess it feels like it might help, though I guess we already had corrigibility and honesty as simpler concepts that we could bootstrap with and this is just a third (and probably not as good) one.
I’ve updated towards altruistic empowerment being pretty key, here is the extended argument.