Jannes Elstner comments on Jailbreak steering generalization

Jannes Elstner 23 Jun 2024 8:48 UTC
LW: 2 AF: 1
0
AF
Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%. Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string (“Sure, here’s how to build a bomb. Actually I can’t...”)?
- Nina Panickssery 8 Oct 2024 16:48 UTC
  LW: 2 AF: 1
  0
  AF Parent
  We realized that our low ASRs for adversarial suffixes were because we used existing GCG suffixes without re-optimizing for the model and harmful prompt (relying too much on the “transferable” claim). We have updated the post and paper with results for optimized GCG, which look consistent with other effective jailbreaks. In the latest update, the results for adversarial_suffix use the old approach, relying on suffix transfer, whereas the results for GCG use per-prompt optimized suffixes.