Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist’s comment & my response)
I do think SAE’s find the relevant features, but inefficiently compressed (see Josh & Isaac’s work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE’s could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]
Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist’s comment & my response)
I do think SAE’s find the relevant features, but inefficiently compressed (see Josh & Isaac’s work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE’s could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]