Jannes Elstner comments on Mechanistically Eliciting Latent Behaviors in Language Models

Jannes Elstner 9 May 2024 20:17 UTC
1 point
5
It would also be interesting to apply MELBO on language models that have already been trained with LAT. Adversarial attacks on vision models look significantly more meaningful to humans when the vision model has been adversarially trained, and since MELBO is basically a latent adversarial attack we should be able to elicit more meaningful behavior on language models trained with LAT.
What links here?
- Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack (3 Dec 2024 21:19 UTC; 83 points)