Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 15 Oct 2024 18:06 UTC
123 points
25
Dean Ball is, among other things, a prominent critic of SB-1047. I meanwhile publicly supported it. But we both talked and it turns out we have a lot of common ground, especially re: the importance of transparency in frontier AI development. So we coauthored this op-ed in TIME: 4 Ways to Advance Transparency in Frontier AI Development. (tweet thread summary here)
- Zac Hatfield-Dodds 16 Oct 2024 0:42 UTC
  32 points
  −28
  Parent
  Thanks Daniel (and Dean) - I’m always glad to hear about people exploring common ground, and the specific proposals sound good to me too.
  
  I think Anthropic already does most of these, as of our RSP update this morning! While I personally support regulation to make such actions universal and binding, I’m glad that we have voluntary commitments in the meantime:
  1. Disclosure of in-development capabilities—in section 7.2 (Transparency and External Input) of our updated RSP, we commit to public disclosures for deployed models, and to notify a relevant U.S. Government entity if any model requires stronger protections than the ASL-2 Standard. I think this is a reasonable balance for a unilateral commitment.
  2. Disclosure of training goal / model spec—as you note, Anthropic publishes both the constitution we train with and our system prompts. I’d be interested in also exploring model-spec-style aspirational documents too.
  3. Public discussion of safety cases and potential risks—there’s some discussion in our Core Views essay and RSP; our capability reports and plans for safeguards and future evaluations are published here starting today (with some redactions for e.g. misuse risks).
  4. Whistleblower protections—RSP section 7.1.5 lays out our noncompliance reporting policy, and 7.1.6 a commitment not to use non-disparagement agreements which could impede or discourage publicly raising safety concerns.
  - ryan_greenblatt 16 Oct 2024 1:41 UTC
    56 points
    27
    Parent
    I think Anthropic already does most of these
    
    This doesn’t seem right to me. I count 0 out of 4 of the things that the TIME op-ed asks for in terms of Anthropic’s stated policies and current actions. I agree that Anthropic does something related to each of the 4 items asked for in the TIME op-ed, but I wouldn’t say that Anthropic does most of these.
    
    If I apply my judgment, I think Anthropic gets perhaps ¹⁄₄ to ¹⁄₂ credit on each of the items for a total of maybe 1.17/4. (1/3 credit on two items and ¹⁄₄ on another two in my judgment.)
    
    To be clear, I don’t think it is obvious that Anthropic should commit to doing these things and I probably wouldn’t urge Anthropic to make unilateral commitments on many or most of these. Edit: Also, I obviously appreciate the stuff that Anthropic is already doing here and it seems better than other AI companies.
    
    Quick list:
    
    Disclosure of in-development capabilities
    
    The TIME op-ed asks that:
    
    when a frontier lab first observes that a novel and major capability (as measured by the lab’s own safety plan) has been reached, the public should be informed
    
    My understanding is that a lot of Dan’s goal with this is reporting “in-development” capabilites rather than deployed capabilites.
    
    The RSP says:
    
    We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model.
    
    [...]
    
    We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity.
    
    [...]
    
    We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard.
    
    If a model isn’t deployed, then public disclosure (likely?) wouldn’t happen, and Anthropic seemingly might not disclose to the government.
    
    From my understanding, this policy is also consistent with doing nothing even for released models except notifying the relevant USG entity if the relevant capabilities are considered “sensitive details”. And, I would also note that there isn’t any mention of timing here. (The TIME op-ed asks for disclosure when a key capability is first reached, even internally.)
    
    Again, not necessarily claiming this is the wrong policy. Also, even if Anthropic doesn’t have an official policy, they might end up disclosing anyway.
    
    I’d count this as roughly 1/3rd credit?
    
    Disclosure of training goal / model spec
    
    Anthropic does disclose system prompts.
    
    However on:
    
    Anthropic publishes both the constitution we train with
    
    Is this true? I wasn’t able to quickly find something that I’m confident is up to date. There is this, but it probably isn’t up to date right? (The blog post is very old.) (The time article also links this old blog post.)
    
    Additionally, I don’t think the constitution should count as a model spec. As far as I can tell, it does not spell out what intended behavior is in a variety of important situations.
    
    (I think OpenAI’s model spec does do this.)
    
    Again, not claiming this is that important to actually do. Also, Anthropic might do this in the future, but they don’t appear to currently be doing this nor do they have a policy of doing this.
    
    I’d count this as 1/4th credit?
    
    Public discussion of safety cases and potential risks
    
    The relevant line in the RSP is again:
    
    We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model.
    
    I’d say this doesn’t meet the ask in the TIME op-ed because:
    
    It doesn’t say that it would include a full safety case and instead just includes “key information” and “summaries”. It’s unclear whether the contents would suffice for a safety case, even putting aside redaction considerations.
    Again, the timing is unclear and isn’t clear if a case would be publicly made for a model which isn’t deployed externally (but is used internally).
    
    It’s possible that Anthropic would release a full (redacted) safety/risk case, but they don’t seem to have a policy of doing this.
    
    I’d count this as ¹⁄₃ credit?
    
    Whistleblower protections
    
    The TIME op-ed says:
    
    After much public debate and feedback, we believe that SB 1047’s whistleblower protections, particularly the protections for AI company employees who report on violations of the law or extreme risks, ended up reasonably close to the ideal.
    
    The relevant policy from Anthropic is:
    
    Noncompliance: We will maintain a process through which Anthropic staff may anonymously notify the Responsible Scaling Officer of any potential instances of noncompliance with this policy. We will also establish a policy governing noncompliance reporting, which will (1) protect reporters from retaliation and (2) set forth a mechanism for escalating reports to one or more members of the Board of Directors in cases where the report relates to conduct of the Responsible Scaling Officer. Further, we will track and investigate any reported or otherwise identified potential instances of noncompliance with this policy. Where reports are substantiated, we will take appropriate and proportional corrective action and document the same. The Responsible Scaling Officer will regularly update the Board of Directors on substantial cases of noncompliance and overall trends.
    
    This is importantly weaker than the SB-1047 whistleblower protections because:
    
    It relies on trusting the Responsible Scaling Officer and/or the Board of Directors. 1047 allows for disclosing to the Attorney General.
    The board doesn’t currently seem very non-conflicted or independent. The LTBT will soon be able to appoint a majority of seats, but they seemingly haven’t appointed all the seats they could do right now. (Just Jay Kreps based on public info.) I’d probably give more credit if the Board and Responsible Scaling Officer were more clearly independent.
    
    It just concerns RSP non-compliance while 1047 includes protections for cases where the “employee has reasonable cause to believe the information indicates [...] that the covered model poses an unreasonable risk of critical harm”. If the employee thinks that the RSP is inadequate to prevent a critical harm, they would have no formal recourse.
    
    I’d count this as ¹⁄₄ credit? I could be argued up to ¹⁄₃, idk.
    - Zac Hatfield-Dodds 16 Oct 2024 18:43 UTC
      10 points
      0
      Parent
      If grading I’d give full credit for (2) on the basis of “documents like these” referring to Anthopic’s constitution + system prompt and OpenAI’s model spec, and more generous partials for the others. I have no desire to litigate details here though, so I’ll leave it at that.
    - Daniel Kokotajlo 16 Oct 2024 2:08 UTC
      10 points
      5
      Parent
      Thanks Ryan, I think I basically agree with your assessment of Anthropic’s policies partial credit compared to what I’d like.
      
      I still think Anthropic deserves credit for going this far at least—thanks Zac!
    - Ben Pace 16 Oct 2024 6:24 UTC
      4 points
      2
      Parent
      We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity.
      Not the point in question, but I am confused why this is not “every government in the world” or at least “every government of a developed nation”. Why single out the US government for knowing about news on a potentially existentially catastrophic course of engineering?
      - Zac Hatfield-Dodds 16 Oct 2024 18:33 UTC
        12 points
        1
        Parent
        Proceeding with training or limited deployments of a “potentially existentially catastrophic” system would clearly violate our RSP, at minimum the commitment to define and publish ASL-4-appropriate safeguards and conduct evaluations confirming that they are not yet necessary. This footnote is referring to models which pose much lower levels of risk.
        
        And it seems unremarkable to me for a US company to ‘single out’ a relevant US government entity as the recipient of a voluntary non-public disclosure of a non-existential risk.
        ryan_greenblatt 17 Oct 2024 0:58 UTC
        5 points
        3
        Parent
        
        This footnote is referring to models which pose much lower levels of risk.
        
        Huh? I thought all of this was the general policy going forward including for powerful systems, e.g. ASL-4. Is there some other disclosure if-then commitment which kicks in for >=ASL-3 or >=ASL-4? (I don’t see this in the RSP.)
        
        I agree that the RSP will have to be updated for ASL-4 once ASL-3 models are encountered, but this doesn’t seem like this would necessarily add additional disclosure requirements.
        
        Proceeding with training or limited deployments of a “potentially existentially catastrophic” system would clearly violate our RSP
        
        Sure, but I care a lot (perhaps mostly) about worlds where Anthropic has roughly two options: an uncompetitively long pause or proceeding while imposing large existential risks on the world via invoking the “loosen the RSP” clause (e.g. 10% lifetime x-risk^[1]). In such world, the relevant disclosure policy is especially important, particularly if such a model isn’t publicly deployed (implying they seeming only have a policy of necessarily disclosing to a relevant USG entity) and Anthropic decides to loosen the RSP.
        
        ↩︎
        That is, 10% “deontological” additional risk which is something like “if all other actors proceeded in the same way with the same safeguards, what would the risks be”. In practice, risks might be correlated such that you don’t actually add as much counterfactual risk on top of risks from other actors (though this will be tricky to determine).
        
        ryan_greenblatt 17 Oct 2024 1:05 UTC
        7 points
        8
        Parent
        Note that the footnote outlining how the RSP would be loosened is:
        
        It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.
        
        This doesn’t clearly require a public disclosure. (What is required to “acknowledge the overall level of risk posed by AI systems”?)
        
        (Separately, I’m quite skeptical of “In such a scenario, because the incremental increase in risk attributable to us would be small”. This appears to be assuming that Anthropic would have sufficient safeguards? (If two actors have comparable safeguards, I would still expect that only having one of these two actors would substantially reduce direct risks.) Maybe this is a mistake and this means to say “if the increase in risk was small, we might decide to lower the required safeguards”.)
    - Ben Pace 16 Oct 2024 6:25 UTC
      2 points
      0
      Parent
      I think OpenAI’s model spec does do this.
      Noob question, does anyone have a link to what document this refers to?
      - Lukas Finnveden 16 Oct 2024 15:51 UTC
        4 points
        0
        Parent
        https://cdn.openai.com/spec/model-spec-2024-05-08.html
  - ryan_greenblatt 16 Oct 2024 17:31 UTC
    17 points
    −9
    Parent
    People seem to be putting a lot of disagree votes on Zac’s comment. I think this is likely in response to my comment addressing “Anthropic already does most of these”. FWIW, I just disagree with this line (and I didn’t disagree vote with the comment overall)^[1]. So, if other people are coming from a similar place as me, it seems a bit sad to pile on disagree votes and I worry about some sort of unnecessarily hostile dynamic here (and I feel like I’ve seen something similar in other places with Zac’s comments).
    
    (I do feel like the main thrust of the comment is an implication like “Anthropic is basically doing these”, which doesn’t seem right to me, but still.)
    
    ↩︎
    I might also disagree with “I think this is a reasonable balance for a unilateral commitment.”, but I certainly don’t have a strong view at the moment here.
    - yams 17 Oct 2024 6:45 UTC
      22 points
      1
      Parent
      I want to double down on this:
      Zac is consistently generous with his time, even when dealing with people who are openly hostile toward him. Of all lab employees, Zac is among the most available for—and eager to engage in—dialogue. He has furnished me personally with >2 dozen hours of extremely informative conversation, even though our views differ significantly (and he has ~no instrumental reason for talking to me in particular, since I am but a humble moisture farmer). I’ve watched him do the same with countless others at various events.
      
      I’ve also watched people yell at him more than once. He kinda shrugged, reframed the topic, and politely continued engaging with the person yelling at him. He has leagues more patience and decorum than is common among the general population. Moreover, in our quarrelsome pocket dimension, he’s part of a mere handful of people with these traits.
      
      I understand distrust of labs (and feel it myself!), but let’s not kill the messenger, lest we run out of messengers.
      - habryka 17 Oct 2024 7:04 UTC
        35 points
        37
        Parent
        Ok, but, that’s what we have the whole agreement/approval distinction for.
        
        I absolutely do not want people to hesitate to disagree vote on something because they are worried that this will be taken as disapproval or social punishment, that’s the whole reason we have two different dimensions! (And it doesn’t look like Zac’s comments are at any risk of ending up with a low approval voting score)
        yams 17 Oct 2024 7:54 UTC
        9 points
        5
        Parent
        I think a non-zero number of those disagree votes would not have appeared if the same comment were made by someone other than an Anthropic employee, based on seeing how Zac is sometimes treated IRL. My comment is aimed most directly at the people who cast those particular disagree votes.
        
        I agree with your comment to Ryan above that those who identified “Anthropic already does most of these” as “the central part of the comment” were using the disagree button as intended.
        
        The threshold for hitting the button will be different in different situations; I think the threshold many applied here was somewhat low, and a brief look at Zac’s comment history, to me, further suggests this.
        Ben Pace 17 Oct 2024 20:27 UTC
        2 points
        0
        Parent
        FTR I upvote-disagreed with the comment, in that I was glad that this dialogue was happening and yet disagreed with the comment. I think it likely I am not the only one.
      - Zac Hatfield-Dodds 22 Oct 2024 4:17 UTC
        5 points
        −3
        Parent
        
        let’s not kill the messenger, lest we run out of messengers.
        
        Unfortunately we’re a fair way into this process, not because of downvotes^[1] but rather because the comments are often dominated by uncharitable interpretations that I can’t productively engage with.^[2]. I’ve had researchers and policy people tell me that reading the discussion convinced them that engaging when their work was discussed on LessWrong wasn’t worth the trouble.
        
        I’m still here, sad that I can’t recommend it to many others, and wondering whether I’ll regret this comment too.
        
        ↩︎
        I also feel there’s a double standard, but don’t think it matters much. Span-level reacts would make it a lot easier to tell what people disagree with though.
        
        ↩︎
        Confidentiality makes any public writing far more effortful than you might expect. Comments which assume ill-faith are deeply unpleasant to engage with, and very rarely have any actionable takeaways. I’ve written and deleted a lot of other stuff here, and can’t find an object-level description that I think is worth posting, but there are plenty of further reasons.
        
        Daniel Kokotajlo 22 Oct 2024 15:24 UTC
        10 points
        0
        Parent
        Sad to hear. Is this thread itself (starting with my parent comment which you replied to) an example of this, or are you referring instead to previous engagements/threads on LW?
    - habryka 16 Oct 2024 17:59 UTC
      8 points
      14
      Parent
      I don’t understand, it seems like the thing that you disagree with is indeed the central point of the comment, so disagree voting seems appropriate? There aren’t really any substantial other parts of the comment that seem like they could cause confusion here about what is being disagreed with.
    - Daniel Kokotajlo 17 Oct 2024 11:28 UTC
      6 points
      2
      Parent
      After reading this thread and going back and forth, I think maybe this is my proposal for how to handle this sort of situation:
      The whole point of the agree/disagree vote dimension is to separate out social stuff e.g. ‘I like this guy and like his comment and want him to feel appreciated’ from epistemic stuff ‘I think the central claim in this comment is false.’ So, I think we should try our best to not discourage people from disagree-voting because they feel bad about someone getting so many disagree-votes for example.
      An alternative is to leave a comment, as you did, explaining the situation and your feelings. I think this is great.
      Another cheaper alternative is to compensate for disagree-voting with a strong-upvote instead of just an upvote.
      Finally, I wonder if there might be some experimentation to do with the statistics you can view on your personal page—e.g. maybe there should be a way to view highly upvoted but disagreed-with comments, either for yourself or across the whole site, with the framing being ‘this is one metric that helps us understand whether healthy dialogue is happening & groupthink is being avoided’
      - Zac Hatfield-Dodds 22 Oct 2024 3:57 UTC
        12 points
        10
        Parent
        I’d find the agree/disagree dimension much more useful if we split out “x people agree, y disagree”—as the EA Forum does—rather than showing the sum of weighted votes (and total number on hover).
        
        I’d also encourage people to use the other reactions more heavily, including on substrings of a comment, but there’s value in the anonymous dis/agree counts too.
  - simeon_c 16 Oct 2024 16:58 UTC
    4 points
    2
    Parent
    I’d be interested in also exploring model-spec-style aspirational documents too.
    Happy to do a call on model-spec-style aspirational documents if it’s any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.

Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Disclosure of in-development capabilities

Disclosure of training goal /​ model spec

Public discussion of safety cases and potential risks

Whistleblower protections

Disclosure of training goal / model spec