Zac Hatfield-Dodds comments on Anthropic rewrote its RSP

Zac Hatfield-Dodds 16 Oct 2024 1:16 UTC
8 points
−8
A more ambitious procedural approach would involve strong third-party auditing.
I’m not aware of any third party who could currently perform such an audit—e.g. METR disclaims that here. We committed to soliciting external expert feedback on capabilities and safeguards reports (RSP §7), and fund new third-party evaluators to grow the ecosystem. Right now though, third-party audit feels to me like a fabricated option rather than lack of ambition.
- Zach Stein-Perlman 16 Oct 2024 1:45 UTC
  28 points
  20
  Parent
  No, in that post METR says it’s excited about trying auditing, but “it was all under NDA” and “We also didn’t have the access necessary to perform a proper evaluation.” Anthropic could commit to share with METR pre-deployment, give them better access, and let them publish stuff about their findings. I don’t know if that would turn out well, but Anthropic could be trying much harder.
  And auditing doesn’t just mean model evals for dangerous capabilities — it could also be for security. (Or procedural stuff, but that doesn’t solve the object-level problem.)
  Sidenote: credit to Sam Bowman for saying
  I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.
  - Beth Barnes 16 Oct 2024 4:08 UTC
    42 points
    9
    Parent
    I’m glad you brought this up, Zac—seems like an important question to get to the bottom of!
    
    METR is somewhat capacity constrained and we can’t currently commit to e.g. being available on a short notice to do thorough evaluations for all the top labs—which is understandably annoying for labs.
    
    Also, we don’t want to discourage people from starting competing evaluation or auditing orgs, or otherwise “camp the space”.
    
    We also don’t want to accidentally safety-wash -that post was written in particular to dispel the idea that “METR has official oversight relationships with all the labs and would tell us if anything really concerning was happening”
    
    All that said, I think labs’ willingness to share access/information etc is a bigger bottleneck than METR’s capacity or expertise. This is especially true for things that involve less intensive labor from METR (e.g. reviewing a lab’s proposed RSP or evaluation protocol and giving feedback, going through a checklist of evaluation best practices, or having an embedded METR employee observing the lab’s processes—as opposed to running a full evaluation ourselves).
    
    I think “Anthropic would love to pilot third party evaluations / oversight more but there just isn’t anyone who can do anything useful here” would be a pretty misleading characterization to take away, and I think there’s substantially more that labs including Anthropic could be doing to support third party evaluations.
    
    If we had a formalized evaluation/auditing relationship with a lab but sometimes evaluations didn’t get run due to our capacity, I expect in most cases we and the lab would want to communicate something along the lines of “the lab is doing their part, any missing evaluations are METR’s fault and shouldn’t be counted against the lab”.
- Buck 17 Oct 2024 22:32 UTC
  5 points
  0
  Parent
  What do you see as the main properties required for an organization to serve as such an evaluator?