That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.