Zac Hatfield-Dodds comments on Anthropic: Reflections on our Responsible Scaling Policy

Zac Hatfield-Dodds 20 May 2024 6:05 UTC
7 points
0

“red line” vs “yellow line”

Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the “register a typo’d domain” step from an ARA eval, because there are only so many good typos for our domain.

assurance mechanisms

Our White House committments mean that we’re already reporting safety evals to the US Government, for example. I think the natural reading of “validated” is some combination of those, though obviously it’s very hard to validate that whatever you’re doing is ‘sufficient’ security against serious cyberattacks or safety interventions on future AI systems. We do our best.

I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
What links here?
- Zach Stein-Perlman's comment on Habryka’s Shortform Feed by habryka (4 Jul 2024 19:38 UTC; 5 points)
- Zach Stein-Perlman 20 May 2024 6:18 UTC
  6 points
  0
  Parent
  Thanks.
  I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
  What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
  Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy, and I expect it’s good, but I’m not confident, and for other companies I wouldn’t necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.
  I mostly take back my secret policy is strong evidence of bad policy insinuation — that’s ~true on my home planet, but on Earth you don’t get sufficient credit for sharing good policies and there’s substantial negative EV from misunderstandings and adversarial interpretations, so I guess it’s often correct to not share :(
  As an ⁸⁰⁄₂₀ of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it’s good or have concerns. I would feel better if that happened all the time.
  What links here?
  - Maybe Anthropic’s Long-Term Benefit Trust is powerless by Zach Stein-Perlman (27 May 2024 13:00 UTC; 199 points)
  - Akash 21 May 2024 17:53 UTC
    2 points
    2
    Parent
    on Earth you don’t get sufficient credit for sharing good policies and there’s substantial negative EV from misunderstandings and adversarial interpretations, so I guess it’s often correct to not share :(
    What’s the substantial negative EV that would come from misunderstanding or adversarial interpretations? I feel like in this case, worst-case would be like “the non-compliance reporting policy is actually pretty good but a few people say mean things about it and say ‘see, here’s why we need government oversight.’ But this feels pretty minor/trivial IMO.
    As an ⁸⁰⁄₂₀ of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it’s good or have concerns. I would feel better if that happened all the time
    This is clever, +1.