As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
I think there’s a mistake here which kind of invalidates the whole post. If we don’t reward our AI for taking bad actions within the training distribution, it’s still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn’t in evolution’s training distribution for us, but then we found it anyway.
I think there’s a mistake here which kind of invalidates the whole post. Ice cream is exactly the kind of thing we’ve been trained to like. Liking ice cream is very much the correct response.
Everything outside the training distribution has some value assigned to it. Merely the fact that we like ice cream isn’t evidence that something’s gone wrong.
I think there’s a mistake here which kind of invalidates the whole post. If we don’t reward our AI for taking bad actions within the training distribution, it’s still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn’t in evolution’s training distribution for us, but then we found it anyway.
I think there’s a mistake here which kind of invalidates the whole post. Ice cream is exactly the kind of thing we’ve been trained to like. Liking ice cream is very much the correct response.
Everything outside the training distribution has some value assigned to it. Merely the fact that we like ice cream isn’t evidence that something’s gone wrong.