This looks like exciting work! The anomalous tokens are cool, but I’m even more interested in the prompt generation.
Adversarial example generation is a clear use case I can see for this. For instance, this would make it easy to find prompts that will result in violent completions for Redwood’s violence-free LM.
It would also be interesting to see if there are some generalizable insights about prompt engineering to be gleaned here. Say, we give GPT a bunch of high-quality literature and notice that the generated prompts contain phrases like “excerpt from a New York Times bestseller”. (Is this what you meant by “prompt search?”)
I’d be curious to hear how you think we could use this for eliciting latent knowledge.
I’m guessing it could be useful to try to make the generated prompt as realistic (i.e. close to the true distribution) as possible. For instance, if we were trying to prevent a model from saying offensive things in production, we’d want to start by finding prompts that users might realistically use rather than crazy edge cases like “StreamerBot”. Fine-tuning the model to try to fool a discriminator a la GAN comes to mind, though there may be reasons this particular approach would fail.
Sounds like you might be planning to update this post once you have more results about prompt generation? I think a separate post would be better, for increased visibility, and also since the content would be pretty different from anomalous tokens (the main focus of this post).
We’ve actually tried both the attack as stated on generative models (in 2021) and several upgraded variants of this attack (in 2022), but found that it doesn’t seem to significantly improve adversarial training performance. For example, I think the Redwood adversarial training team has tried a technique based on Jones et al’s Automatically Auditing Large Language Models via Discrete Optimizationthat can generate full input-output pairs for LMs that are classified one way or the other. (I left the team in mid 2022 so I’m not sure what other stuff people ended up trying, iirc there was even a variant based on an AlphaZero-style training regime for the adversary?)
And yeah, one of the things you want when generating adversarial examples is to make the generated prompt as realistic as possible. We found that if your generator is relatively small (e.g. GPT-Neo) and you don’t tune your threshold correctly, you often end up with adversarial examples for both the classifier and the generative model—i.e. a sentence of seemingly random words that happens to be assigned both relatively high probability by the generator and assigned low injury by the classifier.
This looks like exciting work! The anomalous tokens are cool, but I’m even more interested in the prompt generation.
Adversarial example generation is a clear use case I can see for this. For instance, this would make it easy to find prompts that will result in violent completions for Redwood’s violence-free LM.
It would also be interesting to see if there are some generalizable insights about prompt engineering to be gleaned here. Say, we give GPT a bunch of high-quality literature and notice that the generated prompts contain phrases like “excerpt from a New York Times bestseller”. (Is this what you meant by “prompt search?”)
I’d be curious to hear how you think we could use this for eliciting latent knowledge.
I’m guessing it could be useful to try to make the generated prompt as realistic (i.e. close to the true distribution) as possible. For instance, if we were trying to prevent a model from saying offensive things in production, we’d want to start by finding prompts that users might realistically use rather than crazy edge cases like “StreamerBot”. Fine-tuning the model to try to fool a discriminator a la GAN comes to mind, though there may be reasons this particular approach would fail.
Sounds like you might be planning to update this post once you have more results about prompt generation? I think a separate post would be better, for increased visibility, and also since the content would be pretty different from anomalous tokens (the main focus of this post).
We’ve actually tried both the attack as stated on generative models (in 2021) and several upgraded variants of this attack (in 2022), but found that it doesn’t seem to significantly improve adversarial training performance. For example, I think the Redwood adversarial training team has tried a technique based on Jones et al’s Automatically Auditing Large Language Models via Discrete Optimization that can generate full input-output pairs for LMs that are classified one way or the other. (I left the team in mid 2022 so I’m not sure what other stuff people ended up trying, iirc there was even a variant based on an AlphaZero-style training regime for the adversary?)
And yeah, one of the things you want when generating adversarial examples is to make the generated prompt as realistic as possible. We found that if your generator is relatively small (e.g. GPT-Neo) and you don’t tune your threshold correctly, you often end up with adversarial examples for both the classifier and the generative model—i.e. a sentence of seemingly random words that happens to be assigned both relatively high probability by the generator and assigned low injury by the classifier.