Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don’t understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action.
> “Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion”
Does it mean that in the second step GPT-3.5 generated ‘yes/no’ and then it was fine tuned on this answer?
So the whole idea is that the ‘reward’ here is keeping ‘good’ completions for the next round as determined by GPT-4 (Reward model), right?
The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.
The only thing that is changed is that instead of using regular RL (which we can’t do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward “1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn’t think it’s obvious if it’s young or old, 0 otherwise”.
Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don’t understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action.
> “Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion”
Does it mean that in the second step GPT-3.5 generated ‘yes/no’ and then it was fine tuned on this answer?
So the whole idea is that the ‘reward’ here is keeping ‘good’ completions for the next round as determined by GPT-4 (Reward model), right?
The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.
The only thing that is changed is that instead of using regular RL (which we can’t do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward “1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn’t think it’s obvious if it’s young or old, 0 otherwise”.