Fine-tuned Model: Fine-tune GPT-3 with supervised learning
Reward Model: Make a copy of the FT Model, remove the last layer and add a linear layer, train this as a reward model.
Policy: Make a copy of FT Model, train this with RL using the Reward Model.
2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT
My question: I wonder which of these stages contributes “the most” to the observed mode collapses/behavior.
If skip I 1. and just use GPT-3 without fine-tuning, how does this impact mode collapse?
What if I skip 2. and 3. and just use 1.?
What if I don’t alternate 2. and 3. (maybe the more you alternate, the more you exacerbate?)
The Fine-Tuned Model gradients from 1. are also used in the loss function for 3. I didn’t find the difference in performance from this to be super convincing, I wonder if it also comes with a cost...
The OpenAI style of RLHF progresses in 3 stages:
Fine-tuned Model: Fine-tune GPT-3 with supervised learning
Reward Model: Make a copy of the FT Model, remove the last layer and add a linear layer, train this as a reward model.
Policy: Make a copy of FT Model, train this with RL using the Reward Model.
2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT
My question: I wonder which of these stages contributes “the most” to the observed mode collapses/behavior.
If skip I 1. and just use GPT-3 without fine-tuning, how does this impact mode collapse?
What if I skip 2. and 3. and just use 1.?
What if I don’t alternate 2. and 3. (maybe the more you alternate, the more you exacerbate?)
The Fine-Tuned Model gradients from 1. are also used in the loss function for 3. I didn’t find the difference in performance from this to be super convincing, I wonder if it also comes with a cost...