Andrew Mack comments on Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack 3 May 2024 4:25 UTC
LW: 4 AF: 2
2
AF
Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested “one could maximize the cosine similarity between the differences in target activations across multiple prompts”. The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.
- Bogdan Ionut Cirstea 4 May 2024 9:50 UTC
  2 points
  0
  Parent
  TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I’ve only skimmed though), it might be worth having a look.