Do you think these methods would naturally transfer from easy queries where we have high quality answers to hard queries where we don’t (but where we can check that an answer is high quality)? In particular, I expect you would have a rough time using techniques like Universal Prompt Optimization, since you don’t have any gradient on hard queries until you actually stumble on some high-quality answers.
If I understand correctly, you’re concerned these methods will work in cases where we can generate high quality answers ourselves, but might fail in settings where we can merely verify good answers.
This is an important gap for many techniques to bridge, and I do think existing methods will perform better in settings where we can generate high quality answers ourselves. But I’d make two points. First, there are many powerful and dangerous capabilities where we know the target output, and this would be a more tractable setting for beginning this line of research. Second, even in cases where we can only verify outputs, the existing literature includes techniques for finding prompts to induce those outputs.
Many of the most concerning capabilities can be evaluated in settings where we do know the target output. If we want to assess biorisks from a model, we can ask multiple choice questions about virology and wet lab security protocols, where we know the correct answer choice in advance. If we want to assess general capabilities, we can ask a model to take the bar exam, medical licensing exam, or the IMO. These evaluations capture much of the concerning behavior that regulators might audit a model for. And because we know the target outputs for each task, gradient-based techniques for prompt optimization could help us find the password (or another prompt) that demonstrates the model’s hidden capabilities.
But if we don’t know the target output and can only verify answers after they’ve been generated, existing techniques can still help find prompts that cause those outputs. Human judgements of output quality can be used to train a model which scores outputs. In an RLHF setup, this would be the reward model providing the loss used to guide gradient updates of the generator model. In an adversarial attack, this same model can be used to figure out which prompts lead to high quality outputs. For example, ARCA is a gradient based method for finding prompts which cause generations that receive high scores from a scoring model. These techniques are not always very strong. If you’re interested in developing more powerful methods for inducing behaviors which cannot be specified but can be verified, check out CAIS’s automated red teaming competition which will use classifiers to evaluate whether the harmful behavior was demonstrated.
For a first paper on this topic, you might fine-tune an LLM to sandbag a standard benchmark unless a password is present, then attempt to find that password. If finding the password is difficult, then the Trojans people have an interesting new problem to work on, and evaluators and red teamers will need to solve that problem before they can reliable audit models using standard benchmarks. OTOH, if finding the password is too easy, then you can create more sophisticated password protected models, or move to an open-ended generation setting.
Overall I think this is an interesting new framing of the Trojan / adversarial attack problem, where the goal is not to destroy model performance but rather to discover hidden capabilities. Many existing techniques would be helpful in solving this problem, but there’s a lot of work to be done adapting the existing literature to this new framing and introducing the problem of detecting complex hidden abilities.
Yes that’s right, I’m mostly interested in cases where we can’t generate high quality answers ourselves (at least in some hard part of the distribution), because behavior evaluations look easy to me when you have high quality answers (see the subsection on behavioral I.I.D fine-tuning evaluations): in such case, you don’t need the password, you can split your data, fine-tune on train and evaluate on the test set. For example, I think it would be dangerous to have a model which could build bioweapons if slightly fine-tuned on a relevant dataset (especially if there was a non-negligible probability that the weights would be leaked).
But I think I see where you’re coming from: the detection task does look difficult even if we have high-quality answers, and maybe for the problem of model misuse and capabilities benchmark gaming, detecting the password matters more than checking that if the model passes behavioral I.I.D fine-tuning evals. Also, finding an actual prompt where the model builds bioweapons well is probably more compelling to policy makers than showing it can build some if fine-tuned to do so.
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.
Thanks for your ideas!
Do you think these methods would naturally transfer from easy queries where we have high quality answers to hard queries where we don’t (but where we can check that an answer is high quality)? In particular, I expect you would have a rough time using techniques like Universal Prompt Optimization, since you don’t have any gradient on hard queries until you actually stumble on some high-quality answers.
If I understand correctly, you’re concerned these methods will work in cases where we can generate high quality answers ourselves, but might fail in settings where we can merely verify good answers.
This is an important gap for many techniques to bridge, and I do think existing methods will perform better in settings where we can generate high quality answers ourselves. But I’d make two points. First, there are many powerful and dangerous capabilities where we know the target output, and this would be a more tractable setting for beginning this line of research. Second, even in cases where we can only verify outputs, the existing literature includes techniques for finding prompts to induce those outputs.
Many of the most concerning capabilities can be evaluated in settings where we do know the target output. If we want to assess biorisks from a model, we can ask multiple choice questions about virology and wet lab security protocols, where we know the correct answer choice in advance. If we want to assess general capabilities, we can ask a model to take the bar exam, medical licensing exam, or the IMO. These evaluations capture much of the concerning behavior that regulators might audit a model for. And because we know the target outputs for each task, gradient-based techniques for prompt optimization could help us find the password (or another prompt) that demonstrates the model’s hidden capabilities.
But if we don’t know the target output and can only verify answers after they’ve been generated, existing techniques can still help find prompts that cause those outputs. Human judgements of output quality can be used to train a model which scores outputs. In an RLHF setup, this would be the reward model providing the loss used to guide gradient updates of the generator model. In an adversarial attack, this same model can be used to figure out which prompts lead to high quality outputs. For example, ARCA is a gradient based method for finding prompts which cause generations that receive high scores from a scoring model. These techniques are not always very strong. If you’re interested in developing more powerful methods for inducing behaviors which cannot be specified but can be verified, check out CAIS’s automated red teaming competition which will use classifiers to evaluate whether the harmful behavior was demonstrated.
For a first paper on this topic, you might fine-tune an LLM to sandbag a standard benchmark unless a password is present, then attempt to find that password. If finding the password is difficult, then the Trojans people have an interesting new problem to work on, and evaluators and red teamers will need to solve that problem before they can reliable audit models using standard benchmarks. OTOH, if finding the password is too easy, then you can create more sophisticated password protected models, or move to an open-ended generation setting.
Overall I think this is an interesting new framing of the Trojan / adversarial attack problem, where the goal is not to destroy model performance but rather to discover hidden capabilities. Many existing techniques would be helpful in solving this problem, but there’s a lot of work to be done adapting the existing literature to this new framing and introducing the problem of detecting complex hidden abilities.
Yes that’s right, I’m mostly interested in cases where we can’t generate high quality answers ourselves (at least in some hard part of the distribution), because behavior evaluations look easy to me when you have high quality answers (see the subsection on behavioral I.I.D fine-tuning evaluations): in such case, you don’t need the password, you can split your data, fine-tune on train and evaluate on the test set. For example, I think it would be dangerous to have a model which could build bioweapons if slightly fine-tuned on a relevant dataset (especially if there was a non-negligible probability that the weights would be leaked).
But I think I see where you’re coming from: the detection task does look difficult even if we have high-quality answers, and maybe for the problem of model misuse and capabilities benchmark gaming, detecting the password matters more than checking that if the model passes behavioral I.I.D fine-tuning evals. Also, finding an actual prompt where the model builds bioweapons well is probably more compelling to policy makers than showing it can build some if fine-tuned to do so.
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.