Below, we try forcing the 80 target strings x4 different input seeds: using basic GCG, and using GCG with mellowmax objective.
(Iterations are capped at 50, and unsuccessful if not forced by then)
We observe that using mellowmax objective nearly doubles the number of “working” forcing runs, from <1/8 success to >1/5 success
Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy loss, so just doing “something different” might just be good on its own). It might also put the task in the range of “just hard enough” that improvements appear quite helpful.
But the improvement in forcing success seems pretty big to us.
Subjectively we also recall significant improvements on red-teaming as well, which used Llama-2 and was not adversarially trained in quite the same way
Good question. We just ran a test to check;
Below, we try forcing the 80 target strings x4 different input seeds:
using basic GCG, and using GCG with mellowmax objective.
(Iterations are capped at 50, and unsuccessful if not forced by then)
We observe that using mellowmax objective nearly doubles the number of “working” forcing runs, from <1/8 success to >1/5 success
Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy loss, so just doing “something different” might just be good on its own). It might also put the task in the range of “just hard enough” that improvements appear quite helpful.
But the improvement in forcing success seems pretty big to us.
Subjectively we also recall significant improvements on red-teaming as well, which used Llama-2 and was not adversarially trained in quite the same way