Bogdan Ionut Cirstea comments on Refusal in LLMs is mediated by a single direction

Bogdan Ionut Cirstea 29 Apr 2024 13:14 UTC
2 points
1
You might be interested in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which uses both a somewhat similar empirical methodology for their concept editing and also provides theoretical reasons to expect the linear representation hypothesis to hold (I’d also interpret the findings here and those from other recent works, like Anthropic’s sleeper probes, as evidence towards the linear representation hypothesis broadly).