Nina Panickssery comments on Jailbreak steering generalization

Nina Panickssery 8 Oct 2024 16:48 UTC
LW: 2 AF: 1
0
AF
We realized that our low ASRs for adversarial suffixes were because we used existing GCG suffixes without re-optimizing for the model and harmful prompt (relying too much on the “transferable” claim). We have updated the post and paper with results for optimized GCG, which look consistent with other effective jailbreaks. In the latest update, the results for adversarial_suffix use the old approach, relying on suffix transfer, whereas the results for GCG use per-prompt optimized suffixes.