If it did better than SOTA under the same assumptions, that would be cool and I’m inclined to declare you a winner. If you have to train SAEs with way more compute than typical anti-jailbreak techniques use, I feel a little iffy but I’m probably still going to like it.
Bonus points if, for whatever technique you end up using, you also test the technique which is most like your technique but which doesn’t use SAEs.
I haven’t thought that much about how exactly to make these comparisons, and might change my mind.
I’m also happy to spend at least two hours advising on what would impress me here, feel free to use them as you will.
Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap—I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though—it’s not obvious to me that SAEs are better than steering vectors, though it’s plausible.
I may take you up on the two hours offer, thanks! I’ll ask my co-authors
Ugh I’m a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it’s pretty likely you succeed).
If it did better than SOTA under the same assumptions, that would be cool and I’m inclined to declare you a winner. If you have to train SAEs with way more compute than typical anti-jailbreak techniques use, I feel a little iffy but I’m probably still going to like it.
Bonus points if, for whatever technique you end up using, you also test the technique which is most like your technique but which doesn’t use SAEs.
I haven’t thought that much about how exactly to make these comparisons, and might change my mind.
I’m also happy to spend at least two hours advising on what would impress me here, feel free to use them as you will.
Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap—I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though—it’s not obvious to me that SAEs are better than steering vectors, though it’s plausible.
I may take you up on the two hours offer, thanks! I’ll ask my co-authors
Ugh I’m a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it’s pretty likely you succeed).