Buck comments on Refusal in LLMs is mediated by a single direction

Buck 23 May 2024 14:54 UTC
LW: 6 AF: 4
0
AF
If it did better than SOTA under the same assumptions, that would be cool and I’m inclined to declare you a winner. If you have to train SAEs with way more compute than typical anti-jailbreak techniques use, I feel a little iffy but I’m probably still going to like it.
Bonus points if, for whatever technique you end up using, you also test the technique which is most like your technique but which doesn’t use SAEs.
I haven’t thought that much about how exactly to make these comparisons, and might change my mind.
I’m also happy to spend at least two hours advising on what would impress me here, feel free to use them as you will.
- Neel Nanda 23 May 2024 15:01 UTC
  LW: 5 AF: 3
  0
  AF Parent
  Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap—I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though—it’s not obvious to me that SAEs are better than steering vectors, though it’s plausible.
  
  I may take you up on the two hours offer, thanks! I’ll ask my co-authors
  - Buck 23 May 2024 16:04 UTC
    LW: 7 AF: 6
    4
    AF Parent
    Ugh I’m a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it’s pretty likely you succeed).