Matthew Barnett comments on Matthew Barnett’s Shortform

Matthew Barnett 19 Jun 2024 20:09 UTC
LW: 4 AF: 3
0
AF
Just a quick reply to this:
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
- Daniel Kokotajlo 19 Jun 2024 21:35 UTC
  LW: 8 AF: 3
  0
  AF Parent
  It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
  - Matthew Barnett 19 Jun 2024 21:54 UTC
    LW: 10 AF: 6
    0
    AF Parent
    I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
    How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
    In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
    In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
    Whereas my model looks more like,
    In years 1-4 systems will get gradually more agentic
    There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
    They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
    - Daniel Kokotajlo 19 Jun 2024 23:33 UTC
      LW: 9 AF: 3
      5
      AF Parent
      Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!