No major news here, but some minor good news, and independent of news/commitments/achievements I’m always glad when labs share thoughts like this. Misc reactions below.
Probably the biggest news is the Claude 3 evals report. I haven’t read it yet. But at a glance I’m confused: it sounds like “red line” means ASL-3 but they also operationalize “yellow line” evals and those sound like the previously-discussed ASL-3 evals. Maybe red is actual ASL-3 and yellow is supposed to be at least 6x effective compute lower, as a safety buffer.
“Assurance Mechanisms . . . . should ensure that . . . our safety and security mitigations are validated publicly or by disinterested experts.” This sounds great. I’m not sure what it looks like in practice. I wish it was clearer what assurance mechanisms Anthropic expects or commits to implement and when, and especially whether they’re currently doing anything along the lines of “validated publicly or by disinterested experts.” (Also whether “validated” means “determined to be sufficient if implemented well” or “determined to be implemented well.”)
Something that was ambiguous in the RSP and is still ambiguous here: during training, if Anthropic reaches “3 months since last eval” before “4x since last eval,” do they do evals? Or does the “3 months” condition only apply after training?
I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.) [Edit: and maybe facilitating anonymous back-and-forth conversations is much better than just anonymous one-way reporting, and this should be pretty easy to facilitate.]
Some other hopes for the RSP, off the top of my head:
**ASL-4 definition + operationalization + mitigations, including generally how Anthropic will think about safety cases after the “no dangerous capabilities” safety case doesn’t work anymore
Clarifying security commitments (when the RAND report on securing model weights comes out)
Dangerous capability evals by external auditors, e.g. METR
Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the “register a typo’d domain” step from an ARA eval, because there are only so many good typos for our domain.
assurance mechanisms
Our White House committments mean that we’re already reporting safety evals to the US Government, for example. I think the natural reading of “validated” is some combination of those, though obviously it’s very hard to validate that whatever you’re doing is ‘sufficient’ security against serious cyberattacks or safety interventions on future AI systems. We do our best.
I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy, and I expect it’s good, but I’m not confident, and for other companies I wouldn’t necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.
I mostly take back my secret policy is strong evidence of bad policy insinuation — that’s ~true on my home planet, but on Earth you don’t get sufficient credit for sharing good policies and there’s substantial negative EV from misunderstandings and adversarial interpretations, so I guess it’s often correct to not share :(
As an 80⁄20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it’s good or have concerns. I would feel better if that happened all the time.
on Earth you don’t get sufficient credit for sharing good policies and there’s substantial negative EV from misunderstandings and adversarial interpretations, so I guess it’s often correct to not share :(
What’s the substantial negative EV that would come from misunderstanding or adversarial interpretations? I feel like in this case, worst-case would be like “the non-compliance reporting policy is actually pretty good but a few people say mean things about it and say ‘see, here’s why we need government oversight.’ But this feels pretty minor/trivial IMO.
As an 80⁄20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it’s good or have concerns. I would feel better if that happened all the time
No major news here, but some minor good news, and independent of news/commitments/achievements I’m always glad when labs share thoughts like this. Misc reactions below.
Probably the biggest news is the Claude 3 evals report. I haven’t read it yet. But at a glance I’m confused: it sounds like “red line” means ASL-3 but they also operationalize “yellow line” evals and those sound like the previously-discussed ASL-3 evals. Maybe red is actual ASL-3 and yellow is supposed to be at least 6x effective compute lower, as a safety buffer.
“Assurance Mechanisms . . . . should ensure that . . . our safety and security mitigations are validated publicly or by disinterested experts.” This sounds great. I’m not sure what it looks like in practice. I wish it was clearer what assurance mechanisms Anthropic expects or commits to implement and when, and especially whether they’re currently doing anything along the lines of “validated publicly or by disinterested experts.” (Also whether “validated” means “determined to be sufficient if implemented well” or “determined to be implemented well.”)
Something that was ambiguous in the RSP and is still ambiguous here: during training, if Anthropic reaches “3 months since last eval” before “4x since last eval,” do they do evals? Or does the “3 months” condition only apply after training?
I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.) [Edit: and maybe facilitating anonymous back-and-forth conversations is much better than just anonymous one-way reporting, and this should be pretty easy to facilitate.]
Some other hopes for the RSP, off the top of my head:
**ASL-4 definition + operationalization + mitigations, including generally how Anthropic will think about safety cases after the “no dangerous capabilities” safety case doesn’t work anymore
Clarifying security commitments (when the RAND report on securing model weights comes out)
Dangerous capability evals by external auditors, e.g. METR
Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the “register a typo’d domain” step from an ARA eval, because there are only so many good typos for our domain.
Our White House committments mean that we’re already reporting safety evals to the US Government, for example. I think the natural reading of “validated” is some combination of those, though obviously it’s very hard to validate that whatever you’re doing is ‘sufficient’ security against serious cyberattacks or safety interventions on future AI systems. We do our best.
What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
Thanks.
Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy, and I expect it’s good, but I’m not confident, and for other companies I wouldn’t necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.
I mostly take back my secret policy is strong evidence of bad policy insinuation — that’s ~true on my home planet, but on Earth you don’t get sufficient credit for sharing good policies and there’s substantial negative EV from misunderstandings and adversarial interpretations, so I guess it’s often correct to not share :(
As an 80⁄20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it’s good or have concerns. I would feel better if that happened all the time.
What’s the substantial negative EV that would come from misunderstanding or adversarial interpretations? I feel like in this case, worst-case would be like “the non-compliance reporting policy is actually pretty good but a few people say mean things about it and say ‘see, here’s why we need government oversight.’ But this feels pretty minor/trivial IMO.
This is clever, +1.