Daniel Kokotajlo comments on Anthropic: Three Sketches of ASL-4 Safety Case Components

Daniel Kokotajlo 8 Nov 2024 6:55 UTC
LW: 10 AF: 6
0
AF
Sorry! My response:

1. Yeah you might be right about this, maybe I should get less excited and say something like “it feels like it should cut in half but taking into account Richard’s meta argument I should adjust downwards and maybe it’s just a couple percentage points”

2. If the conditional obtains, that’s also evidence about a bunch of other correlated good things though (timelines being slightly longer, people being somewhat more reasonable in general, etc.) so maybe it is legit to think this would have quite a big effect

3. Are you sure there are so many different factors that are of this size and importance or bigger, such that my p(doom) should be oscilating wildly etc.? Name three. In particular, name three that are (a) of at least this importance, and (b) that have actually happened (or switched from probabily-will-happen-to-probably-won’t) in the last three years. If you can’t name three, then it seems like your claim is false; my p(doom) won’t be oscilating unduly wildly.
- Richard_Ngo 8 Nov 2024 15:04 UTC
  LW: 14 AF: 11
  0
  AF Parent
  1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It’s a step towards applying outside view, though not fully “outside view”.)
  2. Yepp, agreed, though I think the common-sense connotations of “if this became” or “this would have a big effect” are causal, especially in the context where we’re talking to the actors who are involved in making that change. (E.g. the non-causal interpretation of your claim feels somewhat analogous to if I said to you “I’ll be more optimistic about your health if you take these pills”, and so you take the pills, and then I say “well the pills do nothing but now I’m more optimistic, because you’re the sort of person who’s willing to listen to recommendations”. True, but it also undermines people’s willingness/incentive to listen to my claims about what would make the world better.)
  3. Here are ten that affect AI risk as much one way or the other:
  1. The US government “waking up” a couple of years earlier or later (one operationalization: AISIs existing or not right now).
  2. The literal biggest names in the field of AI becoming focused on AI risk.
  3. The fact that Anthropic managed to become a leading lab (and, relatedly, the fact that Meta and other highly safety-skeptical players are still behind).
  4. Trump winning the election.
  5. Elon doing all his Elon stuff (like founding x.AI, getting involved with Trump, etc).
  6. The importance of transparency about frontier capabilities (I think of this one as more of a logical update that I know you’ve made).
  7. o1-style reasoning as the next big breakthrough.
  8. Takeoff speeds (whatever updates you’ve made in the last three years).
  9. China’s trajectory of AI capabilities (whatever updates you’ve made about that in last 3 years).
  10. China’s probability of invading Taiwain (whatever updates you’ve made about that in last 3 years).
  And then I think in 3 years we’ll be able to publish a similar list of stuff that mostly we just hadn’t predicted or thought about before now.
  I expect you’ll dispute a few of these; happy to concede the ones that are specifically about your updates if you disagree (unless you agree that you will probably update a bunch on them in the next 3 years).
  But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don’t really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that’s comparable to or smaller than the things above.
  I think I would be more sympathetic to your view if the claim were “if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit”. That would probably halve my P(doom), it’s just a very very strong criterion.
  - Daniel Kokotajlo 8 Nov 2024 16:01 UTC
    LW: 17 AF: 9
    0
    AF Parent
    Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If… condition obtaining would cut out about half of the loss-of-control ones.
    
    Re 3: point by point:
    1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
    2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
    3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn’t help so much, but yeah p(anthropicwins) has gradually gone up over the last three years...
    4. Trump winning seems like a smaller deal to me.
    5. Ditto for Elon.
    6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did.
    7. ? This was on the mainline path basically and it happened roughly on schedule.
    8. Takeoff speeds matter a ton, I’ve made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more.
    9. Picture here hasn’t changed much in three years.
    10. Ditto.
    
    OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don’t go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn’t actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn’t have been changing much.
    But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don’t really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that’s comparable to or smaller than the things above.
    I agree with everything except the last sentence—my claim took this into account, I was specifically imagining something like this playing out and thinking ‘yep, seems like this kills about half of the loss-of-control worlds’
    I think I would be more sympathetic to your view if the claim were “if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit”. That would probably halve my P(doom), it’s just a very very strong criterion.
    I agree that’s a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a good chance of causing the stronger claim to be true eventually—if a company was getting close to AGI, and they published their safety case a year before and it was gradually being critiqued and iterated on, perhaps public pressure and pressure from the scientific community would build to make it actually good. (Or more optimistically, perhaps the people in the company would start to take it more seriously once they got feedback from the scientific community about it and it therefore started to feel more real and more like a real part of their jobs)
    
    Anyhow bottom line is I won’t stick to my 50% claim, maybe I’ll moderate it down to 25% or something.
    - Richard_Ngo 20 Nov 2024 14:53 UTC
      LW: 11 AF: 8
      0
      AF Parent
      Cool, ty for (characteristically) thoughtful engagement.
      I am still intuitively skeptical about a bunch of your numbers but now it’s the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).