Neel Nanda comments on Interpreting Preference Models w/​ Sparse Autoencoders