It’s actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 → 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there’s 3 bad things: 1. the hh dataset is inconsistent 2. The PM doesn’t separate chosen vs rejected very well (as shown above) 3. The PM is GPT-J (7B parameter model) which doesn’t have the most complex features to choose from.
The in-distribution argument is most likely the case for the “Thank you. My pleasure” case, because the assistant never (AFAIK, I didn’t check) said that phrase as a response. Only “My pleasure” after the user said ” thank you”.
The PM is pretty bad (it’s trained on hh).
It’s actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 → 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there’s 3 bad things:
1. the hh dataset is inconsistent
2. The PM doesn’t separate chosen vs rejected very well (as shown above)
3. The PM is GPT-J (7B parameter model) which doesn’t have the most complex features to choose from.
The in-distribution argument is most likely the case for the “Thank you. My pleasure” case, because the assistant never (AFAIK, I didn’t check) said that phrase as a response. Only “My pleasure” after the user said ” thank you”.