Some links I think do a good job:
https://huggingface.co/blog/rlhf
https://openai.com/research/instruction-following
Thank you. I was completely missing that they used a second ‘preference’ model to score outputs for the RL. I’m surprised that works!
Some links I think do a good job:
https://huggingface.co/blog/rlhf
https://openai.com/research/instruction-following
Thank you. I was completely missing that they used a second ‘preference’ model to score outputs for the RL. I’m surprised that works!