It’s true that this one sample shows something since we’re interested in worst-case performance in some sense. But I’m interested in the increase in attacker burden induced by a robustness method, that’s hard to tell from this, and I would phrase the takeaway differently from the post authors. It’s also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).
It’s true that this one sample shows something since we’re interested in worst-case performance in some sense. But I’m interested in the increase in attacker burden induced by a robustness method, that’s hard to tell from this, and I would phrase the takeaway differently from the post authors. It’s also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).