Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:
Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
Incentives: There will be strong incentives to build and deploy APS-AI.
Alignment difficulty: It will be much harder to build APS-AI systems that don’t seek power in unintended ways, than ones that would seek power but are superficially attractive to deploy.
High-impact failures: Some deployed APS-AI systems will seek power in unintended and high-impact ways, collectively causing >$1 trillion in damage.
Disempowerment: Some of the power-seeking will in aggregate permanently disempower all of humanity.
Catastrophe: The disempowerment will constitute an existential catastrophe.
These steps defines a tree over possibilities. But the associated outcome buckets don’t feel that reality carving to me. A recurring crux is that good outcomes are also highly conjunctive, i.e, one of these 6 conditions failing does not give a good AI outcome. Going through piece by piece:
Timelines makes sense and seems like a good criteria; everything else is downstream of timelines.
Incentives seems wierd. What does the world in which there are no incentives to deploy APS-AI look like? There are a bunch of incentives that clearly do impact people towards this already: status, desire for scientific discovery, power, money. Moreover, this doesn’t seem necessary for AI x-risk—even if we somehow removed the gigantic incentives to build APS-AI that we know exist, people might still deploy APS-AI because they personally wanted to, even though there weren’t social incentives to do so.
Alignment difficulty is another non necessary condition. Some ways of getting x-risk without alignment being very hard:
For one, this is a clear spectrum, and even if it is on the really low end of the system, perhaps you only need a small amount of extra compute overhead to robustly align your system. One of the RAAP stories might occur, and even though technical alignment might be pretty easy, but the companies that spend that extra compute robustly aligning their AIs gradually lose out to other companies in the competitive marketplace.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
High-impact failures contains much of the structure, at least in my eyes. The main ways that we avoid alignment failure are worlds where something happens to take us off of the default trajectory:
Perhaps we make a robust coordination agreement between labs/countries that causes people to avoid deploying until they’ve solved alignment
Perhaps we solve alignment, and harden the world in some way, e.g. by removing compute access, dramatically improving cybersec, monitoring and shutting down dangerous training runs.
In general, thinking about the likelihood of any of these interventions that work, feels very important.
Disempowerment. This and (4), are very entangled with upstream things like takeoff shape. Also, it feels extremely difficult for humanity to not be disempowered.
Catastrophe. To avoid this, again, I need to imagine the extra structure upsteam of this, e.g. 4 was satisfied by a warning shot, and then people coordinated and deployed a benign sovreign that disempowered humanity for good reasons.
My current preferred way to think about likelihood of AI risk routes through something like this framework, but is more structured and has a tree with more conjuncts towards success as well as doom.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
I want to focus on these two, since even in an AI Alignment success stories, these can still happen, and thus it doesn’t count as an AI Alignment failure.
For B, misused is relative to someone’s values, which I want to note a bit here.
For C, I view the idea of a “bad value” or “bad reflection procedures to values”, without asking the question “relative to what and whose values?” a type error, and thus it’s not sensible to talk about bad values/bad reflection procedures in isolation.
Some rough takes on the Carlsmith Report.
Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:
Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
Incentives: There will be strong incentives to build and deploy APS-AI.
Alignment difficulty: It will be much harder to build APS-AI systems that don’t seek power in unintended ways, than ones that would seek power but are superficially attractive to deploy.
High-impact failures: Some deployed APS-AI systems will seek power in unintended and high-impact ways, collectively causing >$1 trillion in damage.
Disempowerment: Some of the power-seeking will in aggregate permanently disempower all of humanity.
Catastrophe: The disempowerment will constitute an existential catastrophe.
These steps defines a tree over possibilities. But the associated outcome buckets don’t feel that reality carving to me. A recurring crux is that good outcomes are also highly conjunctive, i.e, one of these 6 conditions failing does not give a good AI outcome. Going through piece by piece:
Timelines makes sense and seems like a good criteria; everything else is downstream of timelines.
Incentives seems wierd. What does the world in which there are no incentives to deploy APS-AI look like? There are a bunch of incentives that clearly do impact people towards this already: status, desire for scientific discovery, power, money. Moreover, this doesn’t seem necessary for AI x-risk—even if we somehow removed the gigantic incentives to build APS-AI that we know exist, people might still deploy APS-AI because they personally wanted to, even though there weren’t social incentives to do so.
Alignment difficulty is another non necessary condition. Some ways of getting x-risk without alignment being very hard:
For one, this is a clear spectrum, and even if it is on the really low end of the system, perhaps you only need a small amount of extra compute overhead to robustly align your system. One of the RAAP stories might occur, and even though technical alignment might be pretty easy, but the companies that spend that extra compute robustly aligning their AIs gradually lose out to other companies in the competitive marketplace.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
High-impact failures contains much of the structure, at least in my eyes. The main ways that we avoid alignment failure are worlds where something happens to take us off of the default trajectory:
Perhaps we make a robust coordination agreement between labs/countries that causes people to avoid deploying until they’ve solved alignment
Perhaps we solve alignment, and harden the world in some way, e.g. by removing compute access, dramatically improving cybersec, monitoring and shutting down dangerous training runs.
In general, thinking about the likelihood of any of these interventions that work, feels very important.
Disempowerment. This and (4), are very entangled with upstream things like takeoff shape. Also, it feels extremely difficult for humanity to not be disempowered.
Catastrophe. To avoid this, again, I need to imagine the extra structure upsteam of this, e.g. 4 was satisfied by a warning shot, and then people coordinated and deployed a benign sovreign that disempowered humanity for good reasons.
My current preferred way to think about likelihood of AI risk routes through something like this framework, but is more structured and has a tree with more conjuncts towards success as well as doom.
I want to focus on these two, since even in an AI Alignment success stories, these can still happen, and thus it doesn’t count as an AI Alignment failure.
For B, misused is relative to someone’s values, which I want to note a bit here.
For C, I view the idea of a “bad value” or “bad reflection procedures to values”, without asking the question “relative to what and whose values?” a type error, and thus it’s not sensible to talk about bad values/bad reflection procedures in isolation.