Pretrained models don’t need any exploration to know that pressing the reward button gets more reward than doing things the humans want. If you just ask GPT3, it’ll tell you that.
Then the only exploration the AI needs is to get reward after thinking about analogies between its situation and its textual knowledge of AI/reinforcement learning/AI doom scenarios.
This applies especially much to simple/often discussed tasks such as making people smile—an LM has already heard of this exact task, so if it took an action based on the “make people smile task” its heard about, this could outperform other thought processes which are only conditioned on data so far.
OK, but that’s a predictive fact in the world model, not a motivational quantity in the policy. I know about my reward center too, and my brain does RL of some kind, but I don’t primarily care about reward.
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce
Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of “select specific human / subdistribution of humans to imitate” rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used
Pretrained models don’t need any exploration to know that pressing the reward button gets more reward than doing things the humans want. If you just ask GPT3, it’ll tell you that.
Then the only exploration the AI needs is to get reward after thinking about analogies between its situation and its textual knowledge of AI/reinforcement learning/AI doom scenarios.
This applies especially much to simple/often discussed tasks such as making people smile—an LM has already heard of this exact task, so if it took an action based on the “make people smile task” its heard about, this could outperform other thought processes which are only conditioned on data so far.
OK, but that’s a predictive fact in the world model, not a motivational quantity in the policy. I know about my reward center too, and my brain does RL of some kind, but I don’t primarily care about reward.
Here’s a plausible story to me:
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce
Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of “select specific human / subdistribution of humans to imitate” rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used