I’m encouraged by your optimism, and wish you the best of luck (British, and otherwise), but I hope you’re not getting much of your intuition from the “Humans have demonstrated a skill with value extrapolation...” part. I don’t think we have good evidence for this in a broad enough range of circumstances for it to apply well to the AGI case.
We know humans do pretty ‘well’ at this—when surrounded by dozens of other similar agents, in a game-theoretical context where it pays to cooperate, it pays to share values with others, and where extreme failure modes usually lead to loss of any significant power before they can lead to terrible abuse of that power.
Absent such game-theoretic constraints, I don’t think we know much at all about how well humans do at this.
Further, I don’t think I know what it means to do value extrapolation well—beyond something like “you’re doing it well if you’re winning” (what would it look like for almost all humans to do it badly?). That’s fine for situations where cooperation with humans is the best way to win. Not so much where it isn’t.
I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).
I’m encouraged by your optimism, and wish you the best of luck (British, and otherwise), but I hope you’re not getting much of your intuition from the “Humans have demonstrated a skill with value extrapolation...” part. I don’t think we have good evidence for this in a broad enough range of circumstances for it to apply well to the AGI case.
We know humans do pretty ‘well’ at this—when surrounded by dozens of other similar agents, in a game-theoretical context where it pays to cooperate, it pays to share values with others, and where extreme failure modes usually lead to loss of any significant power before they can lead to terrible abuse of that power.
Absent such game-theoretic constraints, I don’t think we know much at all about how well humans do at this.
Further, I don’t think I know what it means to do value extrapolation well—beyond something like “you’re doing it well if you’re winning” (what would it look like for almost all humans to do it badly?). That’s fine for situations where cooperation with humans is the best way to win. Not so much where it isn’t.
But with luck I’m missing something!
I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).