Interestingly, learning a reward model for use in planning has a subtle and pernicious effect we will have to deal with in AGI systems, which AIXI sweeps under the rug: with an imperfect world or reward model, the planner effectively acts as an adversary to the reward model. The planner will try very hard to push the reward model off distribution so as to get it to move into regions where it misgeneralizes and predicts incorrect high reward.
Remix: With an imperfect world… the mind effectively acts as an adversary to the heart.
Think of a person who pursues wealth as an instrumental goal for some combination of doing good, security, comfort, and whatever else their value function ought to be rewarding (“ought” in a personal coherent extrapolated volition sense). They achieve it but then, apparently it’s less uncomfortable to go on accumulating more wealth than it is to get back to the thorny question of what their value function ought to be.
Remix: With an imperfect world… the mind effectively acts as an adversary to the heart.
Think of a person who pursues wealth as an instrumental goal for some combination of doing good, security, comfort, and whatever else their value function ought to be rewarding (“ought” in a personal coherent extrapolated volition sense). They achieve it but then, apparently it’s less uncomfortable to go on accumulating more wealth than it is to get back to the thorny question of what their value function ought to be.