We’ve actually tried both the attack as stated on generative models (in 2021) and several upgraded variants of this attack (in 2022), but found that it doesn’t seem to significantly improve adversarial training performance. For example, I think the Redwood adversarial training team has tried a technique based on Jones et al’s Automatically Auditing Large Language Models via Discrete Optimizationthat can generate full input-output pairs for LMs that are classified one way or the other. (I left the team in mid 2022 so I’m not sure what other stuff people ended up trying, iirc there was even a variant based on an AlphaZero-style training regime for the adversary?)
And yeah, one of the things you want when generating adversarial examples is to make the generated prompt as realistic as possible. We found that if your generator is relatively small (e.g. GPT-Neo) and you don’t tune your threshold correctly, you often end up with adversarial examples for both the classifier and the generative model—i.e. a sentence of seemingly random words that happens to be assigned both relatively high probability by the generator and assigned low injury by the classifier.
We’ve actually tried both the attack as stated on generative models (in 2021) and several upgraded variants of this attack (in 2022), but found that it doesn’t seem to significantly improve adversarial training performance. For example, I think the Redwood adversarial training team has tried a technique based on Jones et al’s Automatically Auditing Large Language Models via Discrete Optimization that can generate full input-output pairs for LMs that are classified one way or the other. (I left the team in mid 2022 so I’m not sure what other stuff people ended up trying, iirc there was even a variant based on an AlphaZero-style training regime for the adversary?)
And yeah, one of the things you want when generating adversarial examples is to make the generated prompt as realistic as possible. We found that if your generator is relatively small (e.g. GPT-Neo) and you don’t tune your threshold correctly, you often end up with adversarial examples for both the classifier and the generative model—i.e. a sentence of seemingly random words that happens to be assigned both relatively high probability by the generator and assigned low injury by the classifier.