As with the previous paper, this argument is only really a problem when the agent’s belief about the reward function is wrong: if it is correct, then at the point where there is no more information to gain, the agent should already know that humans don’t like to be killed, do like to be happy, etc.
There’s also the scenario where the AI models the world in a way that has as good or better predictive power than our intentional stance model, but this weird model assigns undesirable values to the AI’s co-player in the CIRL game. We can’t rely on the agent “already knowing that humans don’t like to be killed,” because the AI doesn’t have to be using the level of abstraction on which “human” or “killed” are natural categories.
I’m just a little leery of calling things “wrong” when it makes the same predictions about observations as being “right.” I don’t want people to think that we can avoid “wrong ontologies” by starting with some reasonable-sounding universal prior and then updating on lots of observational data. Or that something “wrong” will be doing something systematically stupid, probably due to some mistake or limitation that of course the reader would never program into their AI.
There’s also the scenario where the AI models the world in a way that has as good or better predictive power than our intentional stance model, but this weird model assigns undesirable values to the AI’s co-player in the CIRL game. We can’t rely on the agent “already knowing that humans don’t like to be killed,” because the AI doesn’t have to be using the level of abstraction on which “human” or “killed” are natural categories.
I certainly would count an ontological failure in the reward function as an incorrect belief about the reward function.
I’m just a little leery of calling things “wrong” when it makes the same predictions about observations as being “right.” I don’t want people to think that we can avoid “wrong ontologies” by starting with some reasonable-sounding universal prior and then updating on lots of observational data. Or that something “wrong” will be doing something systematically stupid, probably due to some mistake or limitation that of course the reader would never program into their AI.