I think the terminological confusion is with you: what you’re talking about is more like what is called in some RL algorithms a value function.
Does a chess-playing RL agent make whichever move maximises reward? Not unless it has converged to the optimal policy, which in practice it hasn’t. The reward signal of +1 for a win, 0 for a draw and −1 for a loss is, in a sense, hard-coded into the agent, but not in the sense that it’s the metric the agent uses to select actions. Instead the chess-playing agent uses its value function, which is an estimate of the reward the agent will get in the future, but is not the same thing.
The iodinosaurs example perhaps obscures the point since the iodinos seem inner aligned: they probably do terminally value (the feeling of) getting iodine and they are unlikely to instead optimise a proxy. In this case the value function which is used to select actions is very similar to the reward function, but in general it needn’t be, for example in the case where the agent has previously been rewarded for getting raspberries and now has the choice between a raspberry and a blueberry. Even if it knows the blueberry will get it higher reward, it might not care: it values raspberries, and it selects its actions based on what it values.
I think the terminological confusion is with you: what you’re talking about is more like what is called in some RL algorithms a value function.
Does a chess-playing RL agent make whichever move maximises reward? Not unless it has converged to the optimal policy, which in practice it hasn’t. The reward signal of +1 for a win, 0 for a draw and −1 for a loss is, in a sense, hard-coded into the agent, but not in the sense that it’s the metric the agent uses to select actions. Instead the chess-playing agent uses its value function, which is an estimate of the reward the agent will get in the future, but is not the same thing.
The iodinosaurs example perhaps obscures the point since the iodinos seem inner aligned: they probably do terminally value (the feeling of) getting iodine and they are unlikely to instead optimise a proxy. In this case the value function which is used to select actions is very similar to the reward function, but in general it needn’t be, for example in the case where the agent has previously been rewarded for getting raspberries and now has the choice between a raspberry and a blueberry. Even if it knows the blueberry will get it higher reward, it might not care: it values raspberries, and it selects its actions based on what it values.