I don’t think I understand this post. Are you making the claim that “because some subset of human values are (and were selected for being) instrumentally convergent, we don’t have to worry about outer alignment if we project our values down to that subset”?
If so, that seems wrong to me because in most alignment failure scenarios the AI does actually have a terminal goal that to us would seem “arbitrary” or “wrong”. It only pursues the instrumentally convergent goals because they help the AI towards the terminal goal. That means you can’t bank on the AI not turning you into paperclips at some point, because it might judge that to be more expedient than keeping you around as another computer for doing research etc.
In addition, there’s the added danger that if the AI leaves you around, your inability to precommit to some strategies will always pose a threat to the AI’s totalizing vision of a universe full of paperclips. If so, it’s instrumentally convergent for the AI to eliminate or permanently disempower you, even if you yourself are currently aiming for the same goals the AI is aiming for, both instrumental and terminal.
Hmmm, it’s good to know my thesis wasn’t very clear.
The idea is to train an AI on having our values as its end goals. It doesn’t solve for inner alignment issues, indeed. But say the AI wants to maximize paperclips, then it would be constrained to not damaging our survival etc. while making paperclips.
I was trying to figure out what set of values we are even trying to give an AGI in the first place and this was my best guess: whatever else you do, optimize the instrumental convergence goals of humanity.
I don’t think I understand this post. Are you making the claim that “because some subset of human values are (and were selected for being) instrumentally convergent, we don’t have to worry about outer alignment if we project our values down to that subset”?
If so, that seems wrong to me because in most alignment failure scenarios the AI does actually have a terminal goal that to us would seem “arbitrary” or “wrong”. It only pursues the instrumentally convergent goals because they help the AI towards the terminal goal. That means you can’t bank on the AI not turning you into paperclips at some point, because it might judge that to be more expedient than keeping you around as another computer for doing research etc.
In addition, there’s the added danger that if the AI leaves you around, your inability to precommit to some strategies will always pose a threat to the AI’s totalizing vision of a universe full of paperclips. If so, it’s instrumentally convergent for the AI to eliminate or permanently disempower you, even if you yourself are currently aiming for the same goals the AI is aiming for, both instrumental and terminal.
Hmmm, it’s good to know my thesis wasn’t very clear.
The idea is to train an AI on having our values as its end goals. It doesn’t solve for inner alignment issues, indeed. But say the AI wants to maximize paperclips, then it would be constrained to not damaging our survival etc. while making paperclips.
I was trying to figure out what set of values we are even trying to give an AGI in the first place and this was my best guess: whatever else you do, optimize the instrumental convergence goals of humanity.