ryan_greenblatt comments on Daniel Kokotajlo’s Shortform

ryan_greenblatt 16 Oct 2024 1:41 UTC
56 points
27
I think Anthropic already does most of these

This doesn’t seem right to me. I count 0 out of 4 of the things that the TIME op-ed asks for in terms of Anthropic’s stated policies and current actions. I agree that Anthropic does something related to each of the 4 items asked for in the TIME op-ed, but I wouldn’t say that Anthropic does most of these.

If I apply my judgment, I think Anthropic gets perhaps ¹⁄₄ to ¹⁄₂ credit on each of the items for a total of maybe 1.17/4. (1/3 credit on two items and ¹⁄₄ on another two in my judgment.)

To be clear, I don’t think it is obvious that Anthropic should commit to doing these things and I probably wouldn’t urge Anthropic to make unilateral commitments on many or most of these. Edit: Also, I obviously appreciate the stuff that Anthropic is already doing here and it seems better than other AI companies.

Quick list:

Disclosure of in-development capabilities

The TIME op-ed asks that:

when a frontier lab first observes that a novel and major capability (as measured by the lab’s own safety plan) has been reached, the public should be informed

My understanding is that a lot of Dan’s goal with this is reporting “in-development” capabilites rather than deployed capabilites.

The RSP says:

We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model.

[...]

We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity.

[...]

We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard.

If a model isn’t deployed, then public disclosure (likely?) wouldn’t happen, and Anthropic seemingly might not disclose to the government.

From my understanding, this policy is also consistent with doing nothing even for released models except notifying the relevant USG entity if the relevant capabilities are considered “sensitive details”. And, I would also note that there isn’t any mention of timing here. (The TIME op-ed asks for disclosure when a key capability is first reached, even internally.)

Again, not necessarily claiming this is the wrong policy. Also, even if Anthropic doesn’t have an official policy, they might end up disclosing anyway.

I’d count this as roughly 1/3rd credit?

Disclosure of training goal / model spec

Anthropic does disclose system prompts.

However on:

Anthropic publishes both the constitution we train with

Is this true? I wasn’t able to quickly find something that I’m confident is up to date. There is this, but it probably isn’t up to date right? (The blog post is very old.) (The time article also links this old blog post.)

Additionally, I don’t think the constitution should count as a model spec. As far as I can tell, it does not spell out what intended behavior is in a variety of important situations.

(I think OpenAI’s model spec does do this.)

Again, not claiming this is that important to actually do. Also, Anthropic might do this in the future, but they don’t appear to currently be doing this nor do they have a policy of doing this.

I’d count this as 1/4th credit?

Public discussion of safety cases and potential risks

The relevant line in the RSP is again:

We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model.

I’d say this doesn’t meet the ask in the TIME op-ed because:
- It doesn’t say that it would include a full safety case and instead just includes “key information” and “summaries”. It’s unclear whether the contents would suffice for a safety case, even putting aside redaction considerations.
- Again, the timing is unclear and isn’t clear if a case would be publicly made for a model which isn’t deployed externally (but is used internally).
It’s possible that Anthropic would release a full (redacted) safety/risk case, but they don’t seem to have a policy of doing this.

I’d count this as ¹⁄₃ credit?

Whistleblower protections

The TIME op-ed says:

After much public debate and feedback, we believe that SB 1047’s whistleblower protections, particularly the protections for AI company employees who report on violations of the law or extreme risks, ended up reasonably close to the ideal.

The relevant policy from Anthropic is:

Noncompliance: We will maintain a process through which Anthropic staff may anonymously notify the Responsible Scaling Officer of any potential instances of noncompliance with this policy. We will also establish a policy governing noncompliance reporting, which will (1) protect reporters from retaliation and (2) set forth a mechanism for escalating reports to one or more members of the Board of Directors in cases where the report relates to conduct of the Responsible Scaling Officer. Further, we will track and investigate any reported or otherwise identified potential instances of noncompliance with this policy. Where reports are substantiated, we will take appropriate and proportional corrective action and document the same. The Responsible Scaling Officer will regularly update the Board of Directors on substantial cases of noncompliance and overall trends.

This is importantly weaker than the SB-1047 whistleblower protections because:
- It relies on trusting the Responsible Scaling Officer and/or the Board of Directors. 1047 allows for disclosing to the Attorney General.
  - The board doesn’t currently seem very non-conflicted or independent. The LTBT will soon be able to appoint a majority of seats, but they seemingly haven’t appointed all the seats they could do right now. (Just Jay Kreps based on public info.) I’d probably give more credit if the Board and Responsible Scaling Officer were more clearly independent.
- It just concerns RSP non-compliance while 1047 includes protections for cases where the “employee has reasonable cause to believe the information indicates [...] that the covered model poses an unreasonable risk of critical harm”. If the employee thinks that the RSP is inadequate to prevent a critical harm, they would have no formal recourse.
I’d count this as ¹⁄₄ credit? I could be argued up to ¹⁄₃, idk.
- Zac Hatfield-Dodds 16 Oct 2024 18:43 UTC
  10 points
  0
  Parent
  If grading I’d give full credit for (2) on the basis of “documents like these” referring to Anthopic’s constitution + system prompt and OpenAI’s model spec, and more generous partials for the others. I have no desire to litigate details here though, so I’ll leave it at that.
- Daniel Kokotajlo 16 Oct 2024 2:08 UTC
  10 points
  5
  Parent
  Thanks Ryan, I think I basically agree with your assessment of Anthropic’s policies partial credit compared to what I’d like.
  
  I still think Anthropic deserves credit for going this far at least—thanks Zac!
- Ben Pace 16 Oct 2024 6:24 UTC
  4 points
  2
  Parent
  We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity.
  Not the point in question, but I am confused why this is not “every government in the world” or at least “every government of a developed nation”. Why single out the US government for knowing about news on a potentially existentially catastrophic course of engineering?
  - Zac Hatfield-Dodds 16 Oct 2024 18:33 UTC
    12 points
    1
    Parent
    Proceeding with training or limited deployments of a “potentially existentially catastrophic” system would clearly violate our RSP, at minimum the commitment to define and publish ASL-4-appropriate safeguards and conduct evaluations confirming that they are not yet necessary. This footnote is referring to models which pose much lower levels of risk.
    
    And it seems unremarkable to me for a US company to ‘single out’ a relevant US government entity as the recipient of a voluntary non-public disclosure of a non-existential risk.
    - ryan_greenblatt 17 Oct 2024 0:58 UTC
      5 points
      3
      Parent
      
      This footnote is referring to models which pose much lower levels of risk.
      
      Huh? I thought all of this was the general policy going forward including for powerful systems, e.g. ASL-4. Is there some other disclosure if-then commitment which kicks in for >=ASL-3 or >=ASL-4? (I don’t see this in the RSP.)
      
      I agree that the RSP will have to be updated for ASL-4 once ASL-3 models are encountered, but this doesn’t seem like this would necessarily add additional disclosure requirements.
      
      Proceeding with training or limited deployments of a “potentially existentially catastrophic” system would clearly violate our RSP
      
      Sure, but I care a lot (perhaps mostly) about worlds where Anthropic has roughly two options: an uncompetitively long pause or proceeding while imposing large existential risks on the world via invoking the “loosen the RSP” clause (e.g. 10% lifetime x-risk^[1]). In such world, the relevant disclosure policy is especially important, particularly if such a model isn’t publicly deployed (implying they seeming only have a policy of necessarily disclosing to a relevant USG entity) and Anthropic decides to loosen the RSP.
      
      ↩︎
      That is, 10% “deontological” additional risk which is something like “if all other actors proceeded in the same way with the same safeguards, what would the risks be”. In practice, risks might be correlated such that you don’t actually add as much counterfactual risk on top of risks from other actors (though this will be tricky to determine).
      - ryan_greenblatt 17 Oct 2024 1:05 UTC
        7 points
        8
        Parent
        Note that the footnote outlining how the RSP would be loosened is:
        
        It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.
        
        This doesn’t clearly require a public disclosure. (What is required to “acknowledge the overall level of risk posed by AI systems”?)
        
        (Separately, I’m quite skeptical of “In such a scenario, because the incremental increase in risk attributable to us would be small”. This appears to be assuming that Anthropic would have sufficient safeguards? (If two actors have comparable safeguards, I would still expect that only having one of these two actors would substantially reduce direct risks.) Maybe this is a mistake and this means to say “if the increase in risk was small, we might decide to lower the required safeguards”.)
- Ben Pace 16 Oct 2024 6:25 UTC
  2 points
  0
  Parent
  I think OpenAI’s model spec does do this.
  Noob question, does anyone have a link to what document this refers to?
  - Lukas Finnveden 16 Oct 2024 15:51 UTC
    4 points
    0
    Parent
    https://cdn.openai.com/spec/model-spec-2024-05-08.html

ryan_greenblatt comments on Daniel Kokotajlo’s Shortform

Disclosure of in-development capabilities

Disclosure of training goal /​ model spec

Public discussion of safety cases and potential risks

Whistleblower protections

Disclosure of training goal / model spec