The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce
Here’s a plausible story to me:
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce