This is a preference model trained on GPT-J I think, so my guess is that it’s just very dumb and learned lots of silly features. I’d be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.
This is a preference model trained on GPT-J I think, so my guess is that it’s just very dumb and learned lots of silly features. I’d be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.