Bogdan Ionut Cirstea comments on SAE features for refusal and sycophancy steering vectors

Bogdan Ionut Cirstea 12 Oct 2024 21:39 UTC
2 points
0
Seems a bit in the spirit of what’s called ‘high-level cross-validation’ here, between SAEs + auto-interp on the one hand, and activation engineering on the other.
- Jaehyuk Lim 13 Oct 2024 12:58 UTC
  3 points
  0
  Parent
  Although not “circuit-style,” this could also be considered one of these attempts outlined by Mack et al. 2024.
  https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/#:~:text=Unsupervised%20steering%20as,more%20distributed%20circuits.