If our model uses some variant of gradient ascent, it will end up in high reward function values.
Agreed.
In that sense the model does optimize for reward.
Disagreed. Consider vanilla PG, which is as close as I know of to “doing gradient ascent in the reward landscape.” Here, the RL training process is optimizing the model in the direction of historically observed rewards. In such policy gradient methods, the model receives local cognitive updates (in the form of gradients) to increasing the logits on actions which are judged to have produced reward (e.g. in vanilla PG, this is determined by “was the action part of a high-reward trajectory?”). The model is being optimized in the direction of previous rewards, given the collected data distribution (e.g. put some trash away and observed some rewards) and the given states and its current paramterization.
This process might even find very high reward policies. I expect it will. But that doesn’t mean the model is optimizing for reward.
Agreed.
Disagreed. Consider vanilla PG, which is as close as I know of to “doing gradient ascent in the reward landscape.” Here, the RL training process is optimizing the model in the direction of historically observed rewards. In such policy gradient methods, the model receives local cognitive updates (in the form of gradients) to increasing the logits on actions which are judged to have produced reward (e.g. in vanilla PG, this is determined by “was the action part of a high-reward trajectory?”). The model is being optimized in the direction of previous rewards, given the collected data distribution (e.g. put some trash away and observed some rewards) and the given states and its current paramterization.
This process might even find very high reward policies. I expect it will. But that doesn’t mean the model is optimizing for reward.