If you’re in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen “normal behaviour” to normal behaviour in your situation. Reinforcement learning is limited—you can’t always extrapolate past reward—but it’s not obvious that imitative regularisation is fundamentally more limited.
I dunno, I think you can generalize reward farther than behavior. E.g. I might very reasonably issue high reward for winning a game of chess, or arriving at my destination safe and sound, or curing malaria, even if each involved intermediate steps that don’t make sense as ‘things I might do.’
I do agree there are limits to how much extrapolation we actually want, I just think there’s a lot of headroom for AIs to achieve ‘normal’ ends via ‘abnormal’ means.
This safety plan seems like it works right up until you want to use an AI to do something you wouldn’t be able to do.
If you want a superhuman AI to do good things and not bad things, you’ll need a more direct operationalization of good and bad.
If you’re in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen “normal behaviour” to normal behaviour in your situation. Reinforcement learning is limited—you can’t always extrapolate past reward—but it’s not obvious that imitative regularisation is fundamentally more limited.
(normal does not imply safe, of course)
I dunno, I think you can generalize reward farther than behavior. E.g. I might very reasonably issue high reward for winning a game of chess, or arriving at my destination safe and sound, or curing malaria, even if each involved intermediate steps that don’t make sense as ‘things I might do.’
I do agree there are limits to how much extrapolation we actually want, I just think there’s a lot of headroom for AIs to achieve ‘normal’ ends via ‘abnormal’ means.
I would be interested in what the questions of the uncertain imitator would look like in these cases.