Ahhhhhhhhh, interesting! I do certainly agree that upgrading AIs own intelligence is easier than upgrading human intelligence. Hadn’t thought of that. I still firmly hold opinion 2 though.
Let’s think step by step:
It does seem plausible that the easiest & most robust way to empower a given human, in the sense of making it possible for them to achieve arbitrary goals, if you are a powerful AGI, is to build another powerful AGI like yourself but corrigible (or maybe indirectly aligned?) to the human, so that when you hypothetically/counterfactually vary the goals of the human, they still end up getting achieved.
If so, then yeah, seems like making an aligned/corrigible AI reduces to the problem of making the AI optimize for human empowerment.
This is plausibly somewhat easier than solving alignment/corrigibility, because it’s maybe a simpler concept to point to. The concept of “what this human truly wants” is maybe pretty niche and complex whereas the concept of “empower this human” is less so. So if you are hitting the system with SGD until it scores well in your training environment, you might be able to bonk it into a shape where it is trying to empower this human more easily than you can bonk it into a shape where it is trying to do what this human truly wants. You still have to worry about the classic problems (e.g. maybe “do whatever will get high reward in training, then later do whatever you want” is even simpler/more natural and will overwhelmingly be the result you get instead in both cases) but the situation seems somewhat improved at least.
Thanks, I hope people think more about this. I guess it feels like it might help, though I guess we already had corrigibility and honesty as simpler concepts that we could bootstrap with and this is just a third (and probably not as good) one.
Ahhhhhhhhh, interesting! I do certainly agree that upgrading AIs own intelligence is easier than upgrading human intelligence. Hadn’t thought of that. I still firmly hold opinion 2 though.
Let’s think step by step:
It does seem plausible that the easiest & most robust way to empower a given human, in the sense of making it possible for them to achieve arbitrary goals, if you are a powerful AGI, is to build another powerful AGI like yourself but corrigible (or maybe indirectly aligned?) to the human, so that when you hypothetically/counterfactually vary the goals of the human, they still end up getting achieved.
If so, then yeah, seems like making an aligned/corrigible AI reduces to the problem of making the AI optimize for human empowerment.
This is plausibly somewhat easier than solving alignment/corrigibility, because it’s maybe a simpler concept to point to. The concept of “what this human truly wants” is maybe pretty niche and complex whereas the concept of “empower this human” is less so. So if you are hitting the system with SGD until it scores well in your training environment, you might be able to bonk it into a shape where it is trying to empower this human more easily than you can bonk it into a shape where it is trying to do what this human truly wants. You still have to worry about the classic problems (e.g. maybe “do whatever will get high reward in training, then later do whatever you want” is even simpler/more natural and will overwhelmingly be the result you get instead in both cases) but the situation seems somewhat improved at least.
Thanks, I hope people think more about this. I guess it feels like it might help, though I guess we already had corrigibility and honesty as simpler concepts that we could bootstrap with and this is just a third (and probably not as good) one.
I’ve updated towards altruistic empowerment being pretty key, here is the extended argument.