Reward has the mechanistic effect of chiseling cognition into the agent’s network.
Absolutely. Though in the next sentence:
Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I’d mention two things here:
1) The more complex and advanced a model is, the more likely it is [I think] to learn a mesa-optimization goal that is extremely similar to the actual reward a model was trained on (because it’s basically the most generalizable mesa-goal to be learned, w.r.t. training data).
2) Reinforcement learning models in particular design this in by asking models to learn value-functions whose sole purpose is to estimate the expected reward over multiple time steps associated with an action or state. So it’s arguably more natural in a RL scenario, particularly one where scores are visible (e.g. in the corner of the screen for a video-game) to learn this as a “mesa-optimization” goal early on.
Absolutely. Though in the next sentence:
I’d mention two things here:
1) The more complex and advanced a model is, the more likely it is [I think] to learn a mesa-optimization goal that is extremely similar to the actual reward a model was trained on (because it’s basically the most generalizable mesa-goal to be learned, w.r.t. training data).
2) Reinforcement learning models in particular design this in by asking models to learn value-functions whose sole purpose is to estimate the expected reward over multiple time steps associated with an action or state. So it’s arguably more natural in a RL scenario, particularly one where scores are visible (e.g. in the corner of the screen for a video-game) to learn this as a “mesa-optimization” goal early on.