Logan Riggs comments on Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs 5 Jul 2024 18:30 UTC
LW: 2 AF: 1
0
AF
Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist’s comment & my response)
I do think SAE’s find the relevant features, but inefficiently compressed (see Josh & Isaac’s work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE’s could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]