Daniel Kokotajlo comments on Anthropic: Three Sketches of ASL-4 Safety Case Components

Daniel Kokotajlo 6 Nov 2024 19:07 UTC
LW: 29 AF: 11
4
AF
Some comments as I read:
I am pleased to see the sabotage threat models given prominence. I think this is the right sort of threat to focus on, by contrast with e.g. autonomous replication or directly causing harm to humans.
- The ability for the model to, zero-shot, figure out how to hack its own neural activations without such a signal would require capabilities well beyond what we consider plausible. [Note: this claim will require justification in an actual safety case, but we don’t have a clear picture of what this justification would look like.]
DC evals presumably? Try to train models to hack their own neural activations, see if they can quickly learn to do it? (Or even, if they can do it without training, just prompting?) This can be done today right? Get the golden gate bridge activation vector thingy, and then tell Claude “Write me a travel guide to San Francisco but without activating the golden gate bridge activation vector, or, well, activating it as little as possible.” See how good it does. Do scaling analysis to see if smarter bigger models are getting better at doing this. Etc.
As a validation of our entire methodology, we have run red-blue games where the red team tries to produce a misaligned model organism (e.g. a sleeper agent) that can successfully defect in a specific context while avoiding detection by the aforementioned tests.
Nice! Very excited to see this in there.
Overall I’m very happy to see this blog post go up on Anthropic’s website. I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half. (I currently grimly expect that safety cases at this level of detail won’t be constructed until AGI is basically already being trained, and it’ll be done in a hurry and it won’t be published, much less published with enough time for the scientific community to engage with it and for it to be updated in response to feedback. And it could be even worse—I wouldn’t be surprised if the actual safety cases for the first systems that ~completely automate AI R&D are significantly less rigorous than these.)
- Richard_Ngo 8 Nov 2024 2:01 UTC
  LW: 16 AF: 13
  1
  AF Parent
  We have discussed this dynamic before but just for the record:
  I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.
  This is IMO not the type of change that should be able to cut someone’s P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halve your P(doom), then your P(doom) should be oscillating wildly all the time.
  I flag this as an example of prioritizing inside-view considerations too strongly in forecasts. I think this is the sort of problem that arises when you “take bayesianism too seriously”, which is one of the reasons why I wrote my recent post on why I’m not a bayesian (and also my earlier post on Knightian uncertainty).
  For context: our previous discussions about this related to Daniel’s claim that appointing one specific person to one specific important job could change his P(doom) by double digit percentage points. I similarly think this is not the type of consideration that should be able to swing people’s P(doom) that much (except maybe changing the US or Chinese leaders, but we weren’t talking about those).
  Lastly, since this is a somewhat critical comment, I should flag that I really appreciate and admire Daniel’s forecasting, have learned a lot from him, and think he’s generally a great guy. The epistemology disagreements just disproportionately bug me.
  - Daniel Kokotajlo 8 Nov 2024 6:55 UTC
    LW: 10 AF: 6
    0
    AF Parent
    Sorry! My response:
    
    1. Yeah you might be right about this, maybe I should get less excited and say something like “it feels like it should cut in half but taking into account Richard’s meta argument I should adjust downwards and maybe it’s just a couple percentage points”
    
    2. If the conditional obtains, that’s also evidence about a bunch of other correlated good things though (timelines being slightly longer, people being somewhat more reasonable in general, etc.) so maybe it is legit to think this would have quite a big effect
    
    3. Are you sure there are so many different factors that are of this size and importance or bigger, such that my p(doom) should be oscilating wildly etc.? Name three. In particular, name three that are (a) of at least this importance, and (b) that have actually happened (or switched from probabily-will-happen-to-probably-won’t) in the last three years. If you can’t name three, then it seems like your claim is false; my p(doom) won’t be oscilating unduly wildly.
    - Richard_Ngo 8 Nov 2024 15:04 UTC
      LW: 14 AF: 11
      0
      AF Parent
      1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It’s a step towards applying outside view, though not fully “outside view”.)
      2. Yepp, agreed, though I think the common-sense connotations of “if this became” or “this would have a big effect” are causal, especially in the context where we’re talking to the actors who are involved in making that change. (E.g. the non-causal interpretation of your claim feels somewhat analogous to if I said to you “I’ll be more optimistic about your health if you take these pills”, and so you take the pills, and then I say “well the pills do nothing but now I’m more optimistic, because you’re the sort of person who’s willing to listen to recommendations”. True, but it also undermines people’s willingness/incentive to listen to my claims about what would make the world better.)
      3. Here are ten that affect AI risk as much one way or the other:
      The US government “waking up” a couple of years earlier or later (one operationalization: AISIs existing or not right now).
      The literal biggest names in the field of AI becoming focused on AI risk.
      The fact that Anthropic managed to become a leading lab (and, relatedly, the fact that Meta and other highly safety-skeptical players are still behind).
      Trump winning the election.
      Elon doing all his Elon stuff (like founding x.AI, getting involved with Trump, etc).
      The importance of transparency about frontier capabilities (I think of this one as more of a logical update that I know you’ve made).
      o1-style reasoning as the next big breakthrough.
      Takeoff speeds (whatever updates you’ve made in the last three years).
      China’s trajectory of AI capabilities (whatever updates you’ve made about that in last 3 years).
      China’s probability of invading Taiwain (whatever updates you’ve made about that in last 3 years).
      And then I think in 3 years we’ll be able to publish a similar list of stuff that mostly we just hadn’t predicted or thought about before now.
      I expect you’ll dispute a few of these; happy to concede the ones that are specifically about your updates if you disagree (unless you agree that you will probably update a bunch on them in the next 3 years).
      But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don’t really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that’s comparable to or smaller than the things above.
      I think I would be more sympathetic to your view if the claim were “if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit”. That would probably halve my P(doom), it’s just a very very strong criterion.
      - Daniel Kokotajlo 8 Nov 2024 16:01 UTC
        LW: 17 AF: 9
        0
        AF Parent
        Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If… condition obtaining would cut out about half of the loss-of-control ones.
        
        Re 3: point by point:
        1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
        2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
        3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn’t help so much, but yeah p(anthropicwins) has gradually gone up over the last three years...
        4. Trump winning seems like a smaller deal to me.
        5. Ditto for Elon.
        6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did.
        7. ? This was on the mainline path basically and it happened roughly on schedule.
        8. Takeoff speeds matter a ton, I’ve made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more.
        9. Picture here hasn’t changed much in three years.
        10. Ditto.
        
        OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don’t go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn’t actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn’t have been changing much.
        But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don’t really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that’s comparable to or smaller than the things above.
        I agree with everything except the last sentence—my claim took this into account, I was specifically imagining something like this playing out and thinking ‘yep, seems like this kills about half of the loss-of-control worlds’
        I think I would be more sympathetic to your view if the claim were “if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit”. That would probably halve my P(doom), it’s just a very very strong criterion.
        I agree that’s a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a good chance of causing the stronger claim to be true eventually—if a company was getting close to AGI, and they published their safety case a year before and it was gradually being critiqued and iterated on, perhaps public pressure and pressure from the scientific community would build to make it actually good. (Or more optimistically, perhaps the people in the company would start to take it more seriously once they got feedback from the scientific community about it and it therefore started to feel more real and more like a real part of their jobs)
        
        Anyhow bottom line is I won’t stick to my 50% claim, maybe I’ll moderate it down to 25% or something.
        Richard_Ngo 20 Nov 2024 14:53 UTC
        LW: 9 AF: 6
        0
        AF Parent
        Cool, ty for (characteristically) thoughtful engagement.
        I am still intuitively skeptical about a bunch of your numbers but now it’s the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).
  - Noosphere89 8 Nov 2024 2:27 UTC
    2 points
    0
    Parent
    While I agree that people are in general overconfident, including LessWrongers, I don’t particularly think this is because Bayesianism is philosophically incorrect, but rather due to both practical limits on computation combined with sometimes not realizing how data-poor their efforts truly are.
    
    (There are philosophical problems with Bayesianism, but not ones that predict very well the current issues of overconfidence in real human reasoning, so I don’t see why Bayesianism is so central here. Separately, while I’m not sure there can ever be a complete theory of epistemology, I do think that Bayesianism is actually quite general, and a lot of the principles of Bayesianism is probably implemented in human brains, allowing for practicality concerns like cost of compute.)