OK, but that’s a predictive fact in the world model, not a motivational quantity in the policy. I know about my reward center too, and my brain does RL of some kind, but I don’t primarily care about reward.
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce
Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of “select specific human / subdistribution of humans to imitate” rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used
OK, but that’s a predictive fact in the world model, not a motivational quantity in the policy. I know about my reward center too, and my brain does RL of some kind, but I don’t primarily care about reward.
Here’s a plausible story to me:
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce
Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of “select specific human / subdistribution of humans to imitate” rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used