Thank you. I was completely missing that they used a second ‘preference’ model to score outputs for the RL. I’m surprised that works!
Thank you. I was completely missing that they used a second ‘preference’ model to score outputs for the RL. I’m surprised that works!