Yes that’s right, I’m mostly interested in cases where we can’t generate high quality answers ourselves (at least in some hard part of the distribution), because behavior evaluations look easy to me when you have high quality answers (see the subsection on behavioral I.I.D fine-tuning evaluations): in such case, you don’t need the password, you can split your data, fine-tune on train and evaluate on the test set. For example, I think it would be dangerous to have a model which could build bioweapons if slightly fine-tuned on a relevant dataset (especially if there was a non-negligible probability that the weights would be leaked).
But I think I see where you’re coming from: the detection task does look difficult even if we have high-quality answers, and maybe for the problem of model misuse and capabilities benchmark gaming, detecting the password matters more than checking that if the model passes behavioral I.I.D fine-tuning evals. Also, finding an actual prompt where the model builds bioweapons well is probably more compelling to policy makers than showing it can build some if fine-tuned to do so.
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.
Yes that’s right, I’m mostly interested in cases where we can’t generate high quality answers ourselves (at least in some hard part of the distribution), because behavior evaluations look easy to me when you have high quality answers (see the subsection on behavioral I.I.D fine-tuning evaluations): in such case, you don’t need the password, you can split your data, fine-tune on train and evaluate on the test set. For example, I think it would be dangerous to have a model which could build bioweapons if slightly fine-tuned on a relevant dataset (especially if there was a non-negligible probability that the weights would be leaked).
But I think I see where you’re coming from: the detection task does look difficult even if we have high-quality answers, and maybe for the problem of model misuse and capabilities benchmark gaming, detecting the password matters more than checking that if the model passes behavioral I.I.D fine-tuning evals. Also, finding an actual prompt where the model builds bioweapons well is probably more compelling to policy makers than showing it can build some if fine-tuned to do so.
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.