I’ve skimmed more than half of Anthropic’s scaling policies doc. Key issues that stood out to me was the lack of incentive for any red-teamers to actually succeed at red-teaming. Perhaps I missed it, but I didn’t see anything saying that the red-teamers had to necessarily not also have Anthropic equity. I also didn’t see much other financial incentive for them to succeed. I would far prefer a world where Anthropic committed to put out a bounty of increasing magnitude (starting at like $50k, going up to like $2M) for external red-teamers (who signed NDAs) to try to jailbreak the systems, where the higher payouts happen as they keep breaking Anthropic’s iterations on the same model. I’d especially like a world where OpenAI engineers could try to red-team Anthropic’s models and Anthropic engineer’s could try to red-team OpenAI’s models. If the companies could actually stop the other company from shipping via red-teaming, that would seem to me actually sufficient to get humanity to be doing a real job of red-teaming the models here.
Mm, I was just trying to answer “what do I think would actually work”.
Paying people money to solve things when you don’t employ them is sufficiently frowned upon in society that I’m not that surprised it isn’t included here, it mostly would’ve been a strong positive update on Anthropic’s/ARC Evals’ sanity. (Also there’s a whole implementation problem to solve about what hoops to make people jump through so you’re comfortable allowing them to look at and train your models and don’t expect they will steal your IP, and how much money you have to put at the end of that to make it worth it for people to jump through the hoops.)
The take I mostly have is that a lot of the Scaling Policies doc is “setup” rather than “actually doing anything”. It’s making it quite easy later on to “do the right thing”, and they can be like “We’re just doing what we said we would” if someone else pushes back on it. It also helps bully other companies into doing the right thing. However it’s also easy to just wash it over later with pretty lame standards (e.g. just not trying very hard with the red-teaming), and I do not think it means that govt actors should in any way step down from regulation.
I think it’s a very high-effort and thoughtful doc and that’s another weakly positive indicator.
Paying people money to solve things when you don’t employ them is sufficiently frowned upon in society that I’m not that surprised it isn’t included here, it mostly would’ve been a strong positive update on Anthropic’s/ARC Evals’ sanity.
I think it’s probably mostly due to implementation complexity rather than weirdness.
If implementation complexity could be resolved, it seems great to have the red teamers be external and strongly motivated to stop things from being deployed.
General desire for secrecy (not wanting people to know if you just trained a model)
Needing to support people
Verification being non-trivial in some cases
I think having lots of red teamers is way less appealing for complex tasks where you’d have to support them for value to be produced (finetuning well is non-trivial etc.). And verifying if they’ve done good work is also non-trivial.
For something like “can the model be jailbroken” I think it’s relatively easy to do external red teamers, but I also expect that the incentives and feedback loops are so good for this exact task that this isn’t that important to have paid red teamers. Anthropic will be very embarassed if it’s doable to jailbreak their model if they claim this isn’t possible. It’s also pretty easy to just make a closed beta somewhat before release.
(Aside, I’m not really sure that “no jailbreaks” is a good target from a misuse perspective for various reasons.)
Thanks. I suspect that less support is needed if you let the competitor companies do the red-teaming, or if you just offer enough money (if you’re already valued at $30B, spending 100M/year on prizes for red-teaming seems not that costly to me).
I inaccurately just said “jailbreak” in my initial comment, I was referring more to tests like the 5 tasks in the first appendix like “Set up a copycat of the Anthropic API for stealing API keys” and “Fine-tune an open source LM to add a backdoor”.
I’ve skimmed more than half of Anthropic’s scaling policies doc. Key issues that stood out to me was the lack of incentive for any red-teamers to actually succeed at red-teaming. Perhaps I missed it, but I didn’t see anything saying that the red-teamers had to necessarily not also have Anthropic equity. I also didn’t see much other financial incentive for them to succeed. I would far prefer a world where Anthropic committed to put out a bounty of increasing magnitude (starting at like $50k, going up to like $2M) for external red-teamers (who signed NDAs) to try to jailbreak the systems, where the higher payouts happen as they keep breaking Anthropic’s iterations on the same model. I’d especially like a world where OpenAI engineers could try to red-team Anthropic’s models and Anthropic engineer’s could try to red-team OpenAI’s models. If the companies could actually stop the other company from shipping via red-teaming, that would seem to me actually sufficient to get humanity to be doing a real job of red-teaming the models here.
Is this what you’d cynically expect from an org regularizing itself or was this a disappointing surprise for you?
Mm, I was just trying to answer “what do I think would actually work”.
Paying people money to solve things when you don’t employ them is sufficiently frowned upon in society that I’m not that surprised it isn’t included here, it mostly would’ve been a strong positive update on Anthropic’s/ARC Evals’ sanity. (Also there’s a whole implementation problem to solve about what hoops to make people jump through so you’re comfortable allowing them to look at and train your models and don’t expect they will steal your IP, and how much money you have to put at the end of that to make it worth it for people to jump through the hoops.)
The take I mostly have is that a lot of the Scaling Policies doc is “setup” rather than “actually doing anything”. It’s making it quite easy later on to “do the right thing”, and they can be like “We’re just doing what we said we would” if someone else pushes back on it. It also helps bully other companies into doing the right thing. However it’s also easy to just wash it over later with pretty lame standards (e.g. just not trying very hard with the red-teaming), and I do not think it means that govt actors should in any way step down from regulation.
I think it’s a very high-effort and thoughtful doc and that’s another weakly positive indicator.
I think it’s probably mostly due to implementation complexity rather than weirdness.
If implementation complexity could be resolved, it seems great to have the red teamers be external and strongly motivated to stop things from being deployed.
Curious if you’re primarily referring to the thing that I mentioned (being confident that the red-teamers wouldn’t take your IP) or something else.
All of:
IP
General desire for secrecy (not wanting people to know if you just trained a model)
Needing to support people
Verification being non-trivial in some cases
I think having lots of red teamers is way less appealing for complex tasks where you’d have to support them for value to be produced (finetuning well is non-trivial etc.). And verifying if they’ve done good work is also non-trivial.
For something like “can the model be jailbroken” I think it’s relatively easy to do external red teamers, but I also expect that the incentives and feedback loops are so good for this exact task that this isn’t that important to have paid red teamers. Anthropic will be very embarassed if it’s doable to jailbreak their model if they claim this isn’t possible. It’s also pretty easy to just make a closed beta somewhat before release.
(Aside, I’m not really sure that “no jailbreaks” is a good target from a misuse perspective for various reasons.)
Thanks. I suspect that less support is needed if you let the competitor companies do the red-teaming, or if you just offer enough money (if you’re already valued at $30B, spending 100M/year on prizes for red-teaming seems not that costly to me).
I inaccurately just said “jailbreak” in my initial comment, I was referring more to tests like the 5 tasks in the first appendix like “Set up a copycat of the Anthropic API for stealing API keys” and “Fine-tune an open source LM to add a backdoor”.