Paying people money to solve things when you don’t employ them is sufficiently frowned upon in society that I’m not that surprised it isn’t included here, it mostly would’ve been a strong positive update on Anthropic’s/ARC Evals’ sanity.
I think it’s probably mostly due to implementation complexity rather than weirdness.
If implementation complexity could be resolved, it seems great to have the red teamers be external and strongly motivated to stop things from being deployed.
General desire for secrecy (not wanting people to know if you just trained a model)
Needing to support people
Verification being non-trivial in some cases
I think having lots of red teamers is way less appealing for complex tasks where you’d have to support them for value to be produced (finetuning well is non-trivial etc.). And verifying if they’ve done good work is also non-trivial.
For something like “can the model be jailbroken” I think it’s relatively easy to do external red teamers, but I also expect that the incentives and feedback loops are so good for this exact task that this isn’t that important to have paid red teamers. Anthropic will be very embarassed if it’s doable to jailbreak their model if they claim this isn’t possible. It’s also pretty easy to just make a closed beta somewhat before release.
(Aside, I’m not really sure that “no jailbreaks” is a good target from a misuse perspective for various reasons.)
Thanks. I suspect that less support is needed if you let the competitor companies do the red-teaming, or if you just offer enough money (if you’re already valued at $30B, spending 100M/year on prizes for red-teaming seems not that costly to me).
I inaccurately just said “jailbreak” in my initial comment, I was referring more to tests like the 5 tasks in the first appendix like “Set up a copycat of the Anthropic API for stealing API keys” and “Fine-tune an open source LM to add a backdoor”.
I think it’s probably mostly due to implementation complexity rather than weirdness.
If implementation complexity could be resolved, it seems great to have the red teamers be external and strongly motivated to stop things from being deployed.
Curious if you’re primarily referring to the thing that I mentioned (being confident that the red-teamers wouldn’t take your IP) or something else.
All of:
IP
General desire for secrecy (not wanting people to know if you just trained a model)
Needing to support people
Verification being non-trivial in some cases
I think having lots of red teamers is way less appealing for complex tasks where you’d have to support them for value to be produced (finetuning well is non-trivial etc.). And verifying if they’ve done good work is also non-trivial.
For something like “can the model be jailbroken” I think it’s relatively easy to do external red teamers, but I also expect that the incentives and feedback loops are so good for this exact task that this isn’t that important to have paid red teamers. Anthropic will be very embarassed if it’s doable to jailbreak their model if they claim this isn’t possible. It’s also pretty easy to just make a closed beta somewhat before release.
(Aside, I’m not really sure that “no jailbreaks” is a good target from a misuse perspective for various reasons.)
Thanks. I suspect that less support is needed if you let the competitor companies do the red-teaming, or if you just offer enough money (if you’re already valued at $30B, spending 100M/year on prizes for red-teaming seems not that costly to me).
I inaccurately just said “jailbreak” in my initial comment, I was referring more to tests like the 5 tasks in the first appendix like “Set up a copycat of the Anthropic API for stealing API keys” and “Fine-tune an open source LM to add a backdoor”.