gwern comments on Interpreting Preference Models w/​ Sparse Autoencoders