Neel Nanda comments on Refusal in LLMs is mediated by a single direction

Neel Nanda 23 May 2024 15:01 UTC
LW: 5 AF: 3
0
AF
Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap—I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though—it’s not obvious to me that SAEs are better than steering vectors, though it’s plausible.

I may take you up on the two hours offer, thanks! I’ll ask my co-authors
- Buck 23 May 2024 16:04 UTC
  LW: 7 AF: 6
  4
  AF Parent
  Ugh I’m a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it’s pretty likely you succeed).