RGRGRG comments on Mechanistically Eliciting Latent Behaviors in Language Models

RGRGRG 4 May 2024 18:44 UTC
1 point
0
Enjoyed this post! Quick question about obtaining the steering vectors:
Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?
- Andrew Mack 6 May 2024 5:24 UTC
  2 points
  0
  Parent
  Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).
  I haven’t experimented with this, but you could also imagine using only “soft” orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).