Zach Stein-Perlman comments on RSPs are pauses done right

Zach Stein-Perlman 14 Oct 2023 20:44 UTC
24 points
17
What would a good RSP look like?
- Clear commitments along the lines of “we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities.”
- Clear commitments regarding what happens if the evals go off (e.g., “if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.”)
- Clear commitments regarding the safeguards that will be used once evals go off (e.g., “if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.”)
- Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
- Some way of handling race dynamics (such that Bad Guy can’t just be like “haha, cute that you guys are doing RSPs. We’re either not going to engage with your silly RSPs at all, or we’re gonna publish our own RSP but it’s gonna be super watered down and vague”).
Yeah, of course this would be nice. But the reason that ARC and Anthropic didn’t write this ‘good RSP’ isn’t that they’re reckless, but because writing such an RSP is a hard open problem. It would be great to have “specific tests” for various dangerous capabilities, or “Some way of handling race dynamics,” but nobody knows what those are.
Of course the specific object-level commitments Anthropic has made so far are insufficient. (Fortunately, they committed to make more specific object-level commitments before reaching ASL-3, and ASL-3 is reasonably well-specified [edit: and almost certainly below x-catastrophe-level].) I praise Anthropic’s RSP and disagree with your vibe because I don’t think you or I or anyone else could write much better commitments. (If you have specific commitments-labs-should-make in mind, please share them!)
(Insofar as you’re just worried about comms and what-people-think-about-RSPs rather than how-good-RSPs-are, I’m agnostic.)
- Akash 14 Oct 2023 23:33 UTC
  20 points
  18
  Parent
  Thanks, Zach! Responses below:
  But the reason that ARC and Anthropic didn’t write this ‘good RSP’ isn’t that they’re reckless, but because writing such an RSP is a hard open problem
  I agree that writing a good RSP is a hard open problem. I don’t blame ARC for not having solved the “how can we scale safely” problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).
  I’m mostly worried about the comms/advocacy/policy implications. If Anthropic and ARC had come out and said “look, we have some ideas, but the field really isn’t mature enough and we really don’t know what we’re doing, and these voluntary commitments are clearly insufficient, but if you really had to ask us for our best-guesses RE what to do if there is no government regulation coming, and for some reason we had to keep racing toward god-like AI, here are our best guesses. But please note that this is woefully insufficient and we would strongly prefer government intervention to buy enough time so that we can have actual plans.”
  I also expect most of the (positive or negative) impact of the recent RSP posts to come from the comms/advocacy/policy externalities.
  I don’t think you or I or anyone else could write much better commitments
  I don’t think the question of whether you or I could write better commitments is very relevant. My claim is more like “no one can make a good enough RSP right now, so instead of selling governments on RSPs, we should be communicating clearly that the current race to godlike AI is awful, our AIS ideas are primitive, we might need to pause for decades, and we should start developing the hardware monitoring//risk assessment//emergency powers//kill switch infrastructure//international institutions that we will need.”
  But if I had to answer this directly: I actually do think that if I spent 1-2 weeks working on coming up with better commitments, and I could ask for feedback from like 1-3 advisors of my choice, I could probably come up with “better” commitments. I don’t think this is because I’m particularly smart, though– I just think the bar is low. My impression is that the 22-page doc from Anthropic didn’t actually have many commitments.
  The main commitments that stood out to me were: (a) run evals [exact evals unspecified] at least every 4X in effective compute, (b) have good infosec before you have models that can enable bioweapons or other Scary Misuse models, and (c) define ASL-4 criteria once you have Scary Misuse models. There are some other more standard/minor things as well like sharing vulnerabilities with other labs, tiered model access, and patching jailbreaks [how? and how much is sufficient?].
  The main caveat I’ll add is that “better” is a fuzzy term in this context. Like, I’m guessing lot of the commitments I’d come up with are things that are more costly from Anthropic’s POV. So maybe many of them would be “worse” in the sense that Anthropic wouldn’t be willing to adopt them, or would argue that other labs are not going to adopt them, therefore they can’t adopt them otherwise they’ll be less likely to win the race.)
  - jaan 15 Oct 2023 10:20 UTC
    13 points
    5
    Parent
    i would love to see competing RSPs (or, better yet, RTDPs, as @Joe_Collman pointed out in a cousin comment).