Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic’s policies. Two pieces of not-previously-public information:
I was disappointed that Anthropic’s Responsible Scaling Policy only mentions evaluation “During model training and fine-tuning.” Zac told me “this was a simple drafting error—our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we’ve been acting accordingly all along.” Yay.
I said labs should have a “process for staff to escalate concerns about safety” and “have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously.” I noted that Anthropic’s RSP includes a commitment to “Implement a non-compliance reporting policy.” Zac told me “Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1.” Yay.
I think it’s cool that Zac replied (but most of my questions for Anthropic remain).
I have not yet received substantive corrections/clarifications from any other labs.
(I have made some updates to ailabwatch.org based on Zac’s feedback—and revised Anthropic’s score from 45 to 48—but have not resolved all of it.)
Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic’s policies. Two pieces of not-previously-public information:
I was disappointed that Anthropic’s Responsible Scaling Policy only mentions evaluation “During model training and fine-tuning.” Zac told me “this was a simple drafting error—our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we’ve been acting accordingly all along.” Yay.
I said labs should have a “process for staff to escalate concerns about safety” and “have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously.” I noted that Anthropic’s RSP includes a commitment to “Implement a non-compliance reporting policy.” Zac told me “Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1.” Yay.
I think it’s cool that Zac replied (but most of my questions for Anthropic remain).
I have not yet received substantive corrections/clarifications from any other labs.
(I have made some updates to ailabwatch.org based on Zac’s feedback—and revised Anthropic’s score from 45 to 48—but have not resolved all of it.)