Andrew Mack comments on Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack 6 May 2024 5:24 UTC
2 points
0
Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).
I haven’t experimented with this, but you could also imagine using only “soft” orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).