Neel Nanda comments on Interpreting Preference Models w/ Sparse Autoencoders

Neel Nanda 6 Jul 2024 2:47 UTC
LW: 4 AF: 4
0
AF
This is a preference model trained on GPT-J I think, so my guess is that it’s just very dumb and learned lots of silly features. I’d be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.