Ben Pace comments on Daniel Kokotajlo’s Shortform

Ben Pace 16 Oct 2024 6:24 UTC
4 points
2
We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity.
Not the point in question, but I am confused why this is not “every government in the world” or at least “every government of a developed nation”. Why single out the US government for knowing about news on a potentially existentially catastrophic course of engineering?
- Zac Hatfield-Dodds 16 Oct 2024 18:33 UTC
  12 points
  1
  Parent
  Proceeding with training or limited deployments of a “potentially existentially catastrophic” system would clearly violate our RSP, at minimum the commitment to define and publish ASL-4-appropriate safeguards and conduct evaluations confirming that they are not yet necessary. This footnote is referring to models which pose much lower levels of risk.
  
  And it seems unremarkable to me for a US company to ‘single out’ a relevant US government entity as the recipient of a voluntary non-public disclosure of a non-existential risk.
  - ryan_greenblatt 17 Oct 2024 0:58 UTC
    5 points
    3
    Parent
    This footnote is referring to models which pose much lower levels of risk.
    
    Huh? I thought all of this was the general policy going forward including for powerful systems, e.g. ASL-4. Is there some other disclosure if-then commitment which kicks in for >=ASL-3 or >=ASL-4? (I don’t see this in the RSP.)
    
    I agree that the RSP will have to be updated for ASL-4 once ASL-3 models are encountered, but this doesn’t seem like this would necessarily add additional disclosure requirements.
    
    Proceeding with training or limited deployments of a “potentially existentially catastrophic” system would clearly violate our RSP
    
    Sure, but I care a lot (perhaps mostly) about worlds where Anthropic has roughly two options: an uncompetitively long pause or proceeding while imposing large existential risks on the world via invoking the “loosen the RSP” clause (e.g. 10% lifetime x-risk^[1]). In such world, the relevant disclosure policy is especially important, particularly if such a model isn’t publicly deployed (implying they seeming only have a policy of necessarily disclosing to a relevant USG entity) and Anthropic decides to loosen the RSP.
    
    ↩︎
    That is, 10% “deontological” additional risk which is something like “if all other actors proceeded in the same way with the same safeguards, what would the risks be”. In practice, risks might be correlated such that you don’t actually add as much counterfactual risk on top of risks from other actors (though this will be tricky to determine).
    - ryan_greenblatt 17 Oct 2024 1:05 UTC
      7 points
      8
      Parent
      Note that the footnote outlining how the RSP would be loosened is:
      
      It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.
      
      This doesn’t clearly require a public disclosure. (What is required to “acknowledge the overall level of risk posed by AI systems”?)
      
      (Separately, I’m quite skeptical of “In such a scenario, because the incremental increase in risk attributable to us would be small”. This appears to be assuming that Anthropic would have sufficient safeguards? (If two actors have comparable safeguards, I would still expect that only having one of these two actors would substantially reduce direct risks.) Maybe this is a mistake and this means to say “if the increase in risk was small, we might decide to lower the required safeguards”.)