The simplest way to explain “the reward function isn’t the utility function” is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.
(yeah I know maybe we don’t even have utility functions; that’s not the point)
Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn’t have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like “food smells good, I want”.
Evolution couldn’t just select a perfect optimizer of the reward function, because there is no such thing as a perfect optimizer (computational costs mean that a “perfect optimizer” is actually uncomputable). So instead it selected agents that were boundedly optimal given their training environment.
The simplest way to explain “the reward function isn’t the utility function” is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.
(yeah I know maybe we don’t even have utility functions; that’s not the point)
Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn’t have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like “food smells good, I want”.
Evolution couldn’t just select a perfect optimizer of the reward function, because there is no such thing as a perfect optimizer (computational costs mean that a “perfect optimizer” is actually uncomputable). So instead it selected agents that were boundedly optimal given their training environment.