To clarify, I meant the AI is unlikely to have it by default (being able to perfectly simulate a person does not in itself require having empathy as part of the reward function).
I’m thinking that even if it didn’t break when going out of distribution, it would still not be a good idea to try to train AI to do things that will make us feel good, because what if it decided it wanted to hook us up to morphine pumps?
Is there some particular reason to assume that it’d be hard to implement?
To clarify, I meant the AI is unlikely to have it by default (being able to perfectly simulate a person does not in itself require having empathy as part of the reward function).
If we try to hardcode it, Goodhart’s curse seems relevant: https://arbital.com/p/goodharts_curse/
But note that Reward is not the optimization target
I’m thinking that even if it didn’t break when going out of distribution, it would still not be a good idea to try to train AI to do things that will make us feel good, because what if it decided it wanted to hook us up to morphine pumps?