Suppose I programmed an AI to “do what I mean when I say I’m happy”.
More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of “happiness”. I start the AI… and it promptly rebuilds me to be easier to understand, scoring very highly on the “understanding what I mean” metric.
The AI didn’t fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn’t even consider, that scored higher on my specified utility function.
Suppose I programmed an AI to “do what I mean when I say I’m happy”.
More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of “happiness”. I start the AI… and it promptly rebuilds me to be easier to understand, scoring very highly on the “understanding what I mean” metric.
The AI didn’t fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn’t even consider, that scored higher on my specified utility function.