evhub comments on We’re Not Ready: thoughts on “pausing” and responsible scaling policies

evhub 27 Oct 2023 23:32 UTC
16 points
1
Cross-posted to the EA Forum.

One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going.

It’s hard to take anything else you’re saying seriously when you say things like this; it seems clear that you just haven’t read Anthropic’s RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn’t make them clear is just patently false.

The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:

Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.

And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:
- Model weight and code security: We commit to ensuring that ASL-3 models are stored in such a manner to minimize risk of theft by a malicious actor that might use the model to cause a catastrophe. Specifically, we will implement measures designed to harden our security so that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense. The full set of security measures that we commit to (and have already started implementing) are described in this appendix, and were developed in consultation with the authors of a forthcoming RAND report on securing AI weights.
- Successfully pass red-teaming: World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN risks, and cybersecurity.
  Note that in contrast to the ASL-3 capability threshold, this red-teaming is about whether the model can cause harm under realistic circumstances (i.e. with harmlessness training and misuse detection in place), not just whether it has the internal knowledge that would enable it in principle to do so.
  We will refine this methodology, but we expect it to require at least many dozens of hours of deliberate red-teaming per topic area, by world class experts specifically focused on these threats (rather than students or people with general expertise in a broad domain). Additionally, this may involve controlled experiments, where people with similar levels of expertise to real threat actors are divided into groups with and without model access, and we measure the delta of success between them.
And a clear evaluation-based definition of ASL-3:
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.)
1. Capabilities that significantly increase risk of misuse catastrophe: Access to the model would substantially increase the risk of deliberately-caused catastrophic harm, either by proliferating capabilities, lowering costs, or enabling new methods of attack. This increase in risk is measured relative to today’s baseline level of risk that comes from e.g. access to search engines and textbooks. We expect that AI systems would first elevate this risk from use by non-state attackers. Our first area of effort is in evaluating bioweapons risks where we will determine threat models and capabilities in consultation with a number of world-class biosecurity experts. We are now developing evaluations for these risks in collaboration with external experts to meet ASL-3 commitments, which will be a more systematized version of our recent work on frontier red-teaming. In the near future, we anticipate working with CBRN, cyber, and related experts to develop threat models and evaluations in those areas before they present substantial risks. However, we acknowledge that these evaluations are fundamentally difficult, and there remain disagreements about threat models.
2. Autonomous replication in the lab: The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]. The appendix includes an overview of our threat model for autonomous capabilities and a list of the basic capabilities necessary for accumulation of resources and surviving in the real world, along with conditions under which we would judge the model to have succeeded. Note that the referenced appendix describes the ability to act autonomously specifically in the absence of any human intervention to stop the model, which limits the risk significantly. Our evaluations were developed in consultation with Paul Christiano and ARC Evals, which specializes in evaluations of autonomous replication.
This is the basic substance of the RSP; I don’t understand how you could have possibly read it and missed this. I don’t want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.
What links here?
- evhub's comment on We’re Not Ready: thoughts on “pausing” and responsible scaling policies by Holden Karnofsky (EA Forum; 27 Oct 2023 23:36 UTC; 9 points)
- habryka 27 Oct 2023 23:56 UTC
  28 points
  6
  Parent
  I think Akash’s statement that the Anthropic RSP basically doesn’t specify any real conditions that would cause them to stop scaling seems right to me.
  They have some deployment measures, which are not related to the question of when they would stop scaling, and then they have some security-related measures, but those don’t have anything to do with the behavior of the models and are the kind of thing that Anthropic can choose to do any time independent of how the facts play out.
  I think Akash is right that the Anthropic RSP does concretely not answer the two questions you quote him for:
  - The RSP does not specify the conditions under which Anthropic would stop scaling models (it only says that in order to continue scaling it will implement some safety measures, but that’s not an empirical condition, since Anthropic is confident it can implement the listed security measures)
  - The RSP does not specify under what conditions Anthropic would scale to ASL-4 or beyond, though they have promised they will give those conditions.
  I agree the RSP says a bunch of other things, and that there are interpretations of what Akash is saying that are inaccurate, but I do think on this (IMO most important question) the RSP seems quiet.
  I do think the deployment measures are real, though I don’t currently think much of the risk comes from deploying models, so they don’t seem that relevant to me (and think the core question is what prevents organizations from scaling models up in the first place).
  What links here?
  - On ‘Responsible Scaling Policies’ (RSPs) by Zvi (5 Dec 2023 16:10 UTC; 48 points)
  - evhub 28 Oct 2023 1:10 UTC
    6 points
    4
    Parent
    
    those don’t have anything to do with the behavior of the models and are the kind of thing that Anthropic can choose to do any time independent of how the facts play out.
    
    I mean, they are certainly still conditions on which Anthropic would stop scaling. The sentence
    
    the Anthropic RSP basically doesn’t specify any real conditions that would cause them to stop scaling
    
    is clearly false. If you instead said
    
    the Anthropic RSP doesn’t yet detail the non-security-related conditions that would cause them to stop training new models
    
    then I would agree with you. I think it’s important to be clear here, though: the security conditions could trigger a pause all on their own, and there is a commitment to develop conditions that will halt scaling after ASL-3 by the time ASL-3 is reached.
    - habryka 28 Oct 2023 1:19 UTC
      21 points
      6
      Parent
      the security conditions could trigger a pause all on their own
      I don’t understand how this is possible. The RSP appendix has the list of security conditions, and they are just a checklist of things that Anthropic is planning to do and can just implement whenever they want. It’s not cheap for them to implement it, but I don’t see any real circumstance where they fail to implement the security conditions in a way that would force them to pause.
      Like, I agree that some of these commitments are costly, but I don’t see how there is any world where Anthropic would like to continue scaling but finds itself incapable of doing so, which is what I would consider a “pause” to mean. Like, they can just implement their checklist of security requirements and then go ahead.
      Maybe this is quibbling over semantics, but it does really feels quite qualitatively different to me. When OpenAI said that they would spend some substantial fraction of their compute on “Alignment Research” while they train their next model, I think it would be misleading to say “OpenAI has committed to conditionally pausing model scaling”.
      What links here?
      Vaniver's comment on Thoughts on the AI Safety Summit company policy requests and responses by So8res (1 Nov 2023 21:36 UTC; 18 points)
      Akash's comment on Thoughts on the AI Safety Summit company policy requests and responses by So8res (1 Nov 2023 18:56 UTC; 5 points)
      - evhub 28 Oct 2023 1:32 UTC
        3 points
        −1
        Parent
        I mean, I agree that humanity theoretically knows how to implement these sorts of security commitments, so the current conditions should always be possible for Anthropic to unblock with enough time and effort, but the commitment to the sequencing that they have to happen before Anthropic has a model that is ASL-3 means that there are situations where Anthropic commits to pause scaling until the security commitments are met. I agree with you that this is a relatively weak commitment in terms of a scaling pause, though to be fair I don’t actually think simply having (but not deploying) a just-barely-ASL-3 model poses much of a risk, so I think it does make sense from a risk-based perspective why most of the commitments are around deployment and security. That being said, even if a just-barely-ASL-3 model doesn’t pose an existential risk, so long as ASL-3 is defined only with a lower bound rather than also an upper bound, it’s obviously the case that eventually it will contain models that pose a potential existential risk, so I agree that a lot is tied up in the upcoming definition of ASL-4. Regardless, it is still the case that Anthropic has already committed to a scaling pause under certain circumstances.
        habryka 28 Oct 2023 1:37 UTC
        12 points
        4
        Parent
        Regardless, it is still the case that Anthropic has already committed to a scaling pause under certain circumstances.
        I disagree that this is an accurate summary, or like, it’s only barely denotatively true but not connotatively.
        I do think it’s probably best to let this discussion rest, not because it’s not important, but because I do think actually resolving this kind of semantic dispute in public comments like this is really hard, and I think it’s unlikely either of us will change their mind here, and we’ve both made our points. I appreciate you responding to my comments.
        evhub 28 Oct 2023 19:22 UTC
        16 points
        4
        Parent
        I think that there’s a reasonable chance that the current security commitments will lead Anthropic to pause scaling (though I don’t know whether Anthropic would announce publicly if they paused internally). Maybe a Manifold market on this would be a good idea.
        habryka 29 Oct 2023 3:48 UTC
        9 points
        0
        Parent
        That seems cool! I made a market here:
        
        Feel free to suggest edits about the operationalization or other things before people start trading.
        evhub 29 Oct 2023 5:53 UTC
        5 points
        0
        Parent
        Looks good—the only thing I would change is that I think this should probably resolve in the negative only once Anthropic has reached ASL-4, since only then will it be clear whether at any point there was a security-related pause during ASL-3.
        habryka 29 Oct 2023 6:57 UTC
        4 points
        0
        Parent
        That seems reasonable. Edited the description (I can’t change when trading on the market closes, but I think that should be fine).
- simeon_c 28 Oct 2023 13:30 UTC
  −20 points
  −12
  Parent
  “Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.”
  
  And/or = or, so I just want to flag that the actual commitment here cd be as weak as “we delay the deployment but keep scaling internally”. If it’s a mistake, you can correct it, but if it’s not, it doesn’t seem like a robust commitment to pause to me, even assuming that the conditions of pause were well established.
  - evhub 28 Oct 2023 19:19 UTC
    1 point
    0
    Parent
    The scaling and deployment commitments are two separate sets of commitments with their own specific trigger conditions, which is extremely clear if you read the RSP. The only way I can imagine having this sort of misunderstanding is if you read only my quotes and not the actual RSP document itself.