Zach Stein-Perlman comments on Anthropic rewrote its RSP

Zach Stein-Perlman 16 Oct 2024 1:45 UTC
28 points
20
No, in that post METR says it’s excited about trying auditing, but “it was all under NDA” and “We also didn’t have the access necessary to perform a proper evaluation.” Anthropic could commit to share with METR pre-deployment, give them better access, and let them publish stuff about their findings. I don’t know if that would turn out well, but Anthropic could be trying much harder.
And auditing doesn’t just mean model evals for dangerous capabilities — it could also be for security. (Or procedural stuff, but that doesn’t solve the object-level problem.)
Sidenote: credit to Sam Bowman for saying
I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.
- Beth Barnes 16 Oct 2024 4:08 UTC
  42 points
  9
  Parent
  I’m glad you brought this up, Zac—seems like an important question to get to the bottom of!
  
  METR is somewhat capacity constrained and we can’t currently commit to e.g. being available on a short notice to do thorough evaluations for all the top labs—which is understandably annoying for labs.
  
  Also, we don’t want to discourage people from starting competing evaluation or auditing orgs, or otherwise “camp the space”.
  
  We also don’t want to accidentally safety-wash -that post was written in particular to dispel the idea that “METR has official oversight relationships with all the labs and would tell us if anything really concerning was happening”
  
  All that said, I think labs’ willingness to share access/information etc is a bigger bottleneck than METR’s capacity or expertise. This is especially true for things that involve less intensive labor from METR (e.g. reviewing a lab’s proposed RSP or evaluation protocol and giving feedback, going through a checklist of evaluation best practices, or having an embedded METR employee observing the lab’s processes—as opposed to running a full evaluation ourselves).
  
  I think “Anthropic would love to pilot third party evaluations / oversight more but there just isn’t anyone who can do anything useful here” would be a pretty misleading characterization to take away, and I think there’s substantially more that labs including Anthropic could be doing to support third party evaluations.
  
  If we had a formalized evaluation/auditing relationship with a lab but sometimes evaluations didn’t get run due to our capacity, I expect in most cases we and the lab would want to communicate something along the lines of “the lab is doing their part, any missing evaluations are METR’s fault and shouldn’t be counted against the lab”.