Logan Riggs comments on Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs 5 Jul 2024 18:19 UTC
LW: 4 AF: 2
0
AF
The PM is pretty bad (it’s trained on hh).
It’s actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 → 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there’s 3 bad things:
1. the hh dataset is inconsistent
2. The PM doesn’t separate chosen vs rejected very well (as shown above)
3. The PM is GPT-J (7B parameter model) which doesn’t have the most complex features to choose from.
The in-distribution argument is most likely the case for the “Thank you. My pleasure” case, because the assistant never (AFAIK, I didn’t check) said that phrase as a response. Only “My pleasure” after the user said ” thank you”.