I’ve been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, “2026 or 2027...there could be a mild delay”)
Here’s a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it’s just straightforwardly able to do the job of a researcher). Suppose further that it’s not a significant additional compute expense; let’s say that OAI can immediately deploy a million instances.
What outcome would you expect? Let’s operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?
In short, what do you think are the chances that that doesn’t end disastrously?
Depends what they do with it. If they use it to do the natural and obvious capabilities research, like they currently are (mixed with a little hodge podge alignment to keep it roughly on track), I think we just basically for sure die. If they pivot hard to solving alignment in a very different paradigm and.. no, this hypothetical doesn’t imply the AI can discover or switch to other paradigms.
I wish I shared your optimism! You’ve talked about some of your reasons for it elsewhere, but I’d be interested to hear even a quick sketch of roughly how you imagine the next decade to go in the context of the thought experiment, in the 70-80% of cases where you expect things to go well.
The next decade from 2026-2036 will probably be wild, conditional on your scenario starting to pass, and my guess is that robotics is solved 2-5 years after the new AI is introduced.
But to briefly talk about the 70-80% of worlds where we make it through, several common properties appear:
Data still matters a great deal for capabilities and alignment, and the sparse RL problem where you try to get an AI to do something based on very little data will essentially not contribute to capabilities for the next several decades, if ever (I’m defining it as the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all.)
Unlearning becomes more effective, such that we can remove certain capabilities without damaging the rest of the system, and this technique is pretty illustrative:
As far as my sketch of how the world goes in the median future, conditional on them achieving something like a research AI in 2026, they first automate their own research, which will take 1-5 years, then solve robotics, which will take another 2-5 years, and by 2036, the economy starts seriously feeling the impact of an AI that can replace everyone’s jobs.
The big reason why this change is slower than a lot of median predictions is a combination of AI science being more disconnectable from the rest of the economy than most others, combined with the problems being solvable, but with a lot of edge case that will take time to iron out (similar to how self driving cars went from being very bad in the 2000s to actually working in 2021-2023.)
The big question is if distributed training works out.
Thanks for sketching that out, I appreciate it. Unlearning significantly improving the safety outlook is something I may not have fully priced in.
My guess is that the central place we differ is that I expect dropping in, say, 100k extra capabilities researchers gets us into greater-than-human intelligence fairly quickly—we’re already seeing LLMs scoring better than human in various areas, so clearly there’s no hard barrier at human level—and at that point control gets extremely difficult.
I do certainly agree that there’s a lot of low-hanging fruit in control that’s well worth grabbing.
I realize that asking about p(doom) is utterly2023, but I’m interested to see if there’s a rough consensus in the community about how it would go if it were now, and then it’s possible to consider how that shifts as the amount of time moves forward.
We have enough AI to cause billion deaths in the next decade via mass production of AI-drones, robotic armies and AI-empowered strategic planners. No new capabilities are needed.
I’ve been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, “2026 or 2027...there could be a mild delay”)
Here’s a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it’s just straightforwardly able to do the job of a researcher). Suppose further that it’s not a significant additional compute expense; let’s say that OAI can immediately deploy a million instances.
What outcome would you expect? Let’s operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?
In short, what do you think are the chances that that doesn’t end disastrously?
Depends what they do with it. If they use it to do the natural and obvious capabilities research, like they currently are (mixed with a little hodge podge alignment to keep it roughly on track), I think we just basically for sure die. If they pivot hard to solving alignment in a very different paradigm and.. no, this hypothetical doesn’t imply the AI can discover or switch to other paradigms.
I think doom is almost certain in this scenario.
If we could trust OpenAI to handle this scenario responsibly, our odds would definitely seem better to me.
I’d say that we’d have a 70-80% chance of going through the next decade without causing a billion deaths if powerful AI comes.
I wish I shared your optimism! You’ve talked about some of your reasons for it elsewhere, but I’d be interested to hear even a quick sketch of roughly how you imagine the next decade to go in the context of the thought experiment, in the 70-80% of cases where you expect things to go well.
The next decade from 2026-2036 will probably be wild, conditional on your scenario starting to pass, and my guess is that robotics is solved 2-5 years after the new AI is introduced.
But to briefly talk about the 70-80% of worlds where we make it through, several common properties appear:
Data still matters a great deal for capabilities and alignment, and the sparse RL problem where you try to get an AI to do something based on very little data will essentially not contribute to capabilities for the next several decades, if ever (I’m defining it as the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all.)
Unlearning becomes more effective, such that we can remove certain capabilities without damaging the rest of the system, and this technique is pretty illustrative:
https://x.com/scaling01/status/1865200522581418298
AI control becomes a bigger deal in labs, such that it enables more ability to prevent self-exfiltration.
I like Mark Xu’s arguments, and If I’m wrong about alignment being easy, AI control would be more important for safety.
https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#FGf6reY3CotGh4ewv
As far as my sketch of how the world goes in the median future, conditional on them achieving something like a research AI in 2026, they first automate their own research, which will take 1-5 years, then solve robotics, which will take another 2-5 years, and by 2036, the economy starts seriously feeling the impact of an AI that can replace everyone’s jobs.
The big reason why this change is slower than a lot of median predictions is a combination of AI science being more disconnectable from the rest of the economy than most others, combined with the problems being solvable, but with a lot of edge case that will take time to iron out (similar to how self driving cars went from being very bad in the 2000s to actually working in 2021-2023.)
The big question is if distributed training works out.
Thanks for sketching that out, I appreciate it. Unlearning significantly improving the safety outlook is something I may not have fully priced in.
My guess is that the central place we differ is that I expect dropping in, say, 100k extra capabilities researchers gets us into greater-than-human intelligence fairly quickly—we’re already seeing LLMs scoring better than human in various areas, so clearly there’s no hard barrier at human level—and at that point control gets extremely difficult.
I do certainly agree that there’s a lot of low-hanging fruit in control that’s well worth grabbing.
I realize that asking about p(doom) is utterly 2023, but I’m interested to see if there’s a rough consensus in the community about how it would go if it were now, and then it’s possible to consider how that shifts as the amount of time moves forward.
We have enough AI to cause billion deaths in the next decade via mass production of AI-drones, robotic armies and AI-empowered strategic planners. No new capabilities are needed.
Granted—but I think the chances of that happening are different in my proposed scenario than currently.