Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap—I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though—it’s not obvious to me that SAEs are better than steering vectors, though it’s plausible.
I may take you up on the two hours offer, thanks! I’ll ask my co-authors
Ugh I’m a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it’s pretty likely you succeed).
Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap—I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though—it’s not obvious to me that SAEs are better than steering vectors, though it’s plausible.
I may take you up on the two hours offer, thanks! I’ll ask my co-authors
Ugh I’m a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it’s pretty likely you succeed).