Bogdan Ionut Cirstea comments on Mechanistically Eliciting Latent Behaviors in Language Models

Bogdan Ionut Cirstea 2 May 2024 17:35 UTC
LW: 3 AF: 2
1
AF
Unsupervised Feature Detection There is a rich literature on unsupervised feature detection in neural networks.
It might be interesting to add (some of) the literature doing unsupervised feature detection in GANs and in diffusion models (e.g. see recent work from Pinar Yanardag and citation trails).
Related, I wonder if instead of / separately from the L2 distance, using something like a contrastive loss (similarly to how it was used in NoiseCLR or in LatentCLR) might produce interesting / different results.
- Andrew Mack 3 May 2024 4:25 UTC
  LW: 4 AF: 2
  2
  AF Parent
  Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested “one could maximize the cosine similarity between the differences in target activations across multiple prompts”. The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.
  - Bogdan Ionut Cirstea 4 May 2024 9:50 UTC
    2 points
    0
    Parent
    TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I’ve only skimmed though), it might be worth having a look.