What do you think of the features found? They seem to work, given your causal manipulations, but looking over them, they seem… very superficial. Like penalizing URLs per se doesn’t seem like a great thing to have in a reward model. (A LLM has typically memorized or can guess lots of useful URLs.) It doesn’t match my own ‘preferences’, as far as I can tell, and so is doing a bad job at ‘preference learning’.
Is this an artifact of SAE being able to express only simple linear features by design and there are more complex features or computations which yield more appropriate responses to compensate for the initial simple frugal heuristics, or is this kind of preference learning really just that dumb? (I’m reminded of the mode collapse issues like rhyming poetry. If you did SAE on the reward model for ChatGPT, would you find a feature as simple as ‘rhyming = gud’?)
This is a preference model trained on GPT-J I think, so my guess is that it’s just very dumb and learned lots of silly features. I’d be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.
Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist’s comment & my response)
I do think SAE’s find the relevant features, but inefficiently compressed (see Josh & Isaac’s work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE’s could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]
What do you think of the features found? They seem to work, given your causal manipulations, but looking over them, they seem… very superficial. Like penalizing URLs per se doesn’t seem like a great thing to have in a reward model. (A LLM has typically memorized or can guess lots of useful URLs.) It doesn’t match my own ‘preferences’, as far as I can tell, and so is doing a bad job at ‘preference learning’.
Is this an artifact of SAE being able to express only simple linear features by design and there are more complex features or computations which yield more appropriate responses to compensate for the initial simple frugal heuristics, or is this kind of preference learning really just that dumb? (I’m reminded of the mode collapse issues like rhyming poetry. If you did SAE on the reward model for ChatGPT, would you find a feature as simple as ‘rhyming = gud’?)
This is a preference model trained on GPT-J I think, so my guess is that it’s just very dumb and learned lots of silly features. I’d be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.
Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist’s comment & my response)
I do think SAE’s find the relevant features, but inefficiently compressed (see Josh & Isaac’s work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE’s could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]