I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn’t seem to exist, so here’s my attempt.
Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
Initially, this world looks great from a human perspective, and most people are much richer than they are today.
But things then go badly in one of two ways (or more likely, a combination of both).
[Part 1] Going out with a whimper
In the training process, we used easily-measurable proxy goals as objective functions, that don’t push the AI systems to do what we actually want e.g.
‘maximise positive feedback from your operator’ instead of ‘try to help your operator get what they actually want’
‘reduce reported crimes’ instead of ‘actually prevent crime’
‘increase reported life satisfaction’ instead of ‘actually help humans live good lives’
‘increasing human wealth on paper’ instead of ‘increasing effective human control over resources’
(We did this because ML needs lots of data/feedback to train systems, and you can collect much more data/feedback on easily-measurable objectives.)
Due to competitive pressures, systems continue being deployed despite some people pointing out this is a bad idea.
The goals of AI systems gradually gain more influence over the future relative to human goals.
Eventually, the proxies for which the AI systems are optimising come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory. In the end, we will either go extinct or be mostly disempowered.
(In some sense, this isn’t really a big departure from what is already happening today—just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).
[Part 2] Going out with a bang
These AI systems end up learning objectives that are unrelated to the objective functions used in the training process, because the objective they ended up learning was more naturally discovered during the training process (e.g. “don’t get shut down”).
The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down).
Early in training, the best way to do that is by being obedient (since systems understand that unobedient behaviour would get them shut down).
Then, once the systems become sufficiently capable, they attempt to acquire resources and influence to more effectively achieve their goals, including by eliminating the influence of humans. In the end, humans will most likely go extinct, because the systems have no incentive to preserve our survival.
This is a bit misleading in that the scale of the disaster is more apparent in the regime that takes place some time after this story, when the AI systems are disassembling the Solar System. At that point, humanity would only remain if it’s intentionally maintained, so speaking of “our trajectory” that’s not being steered by our will is too optimistic. And since by then there’s tech that can reconstruct humanity from data, there is even less point in keeping us online than there’s currently for gorillas, it’s feasible to just archive and forget.
Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.
I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here—basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there’s some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.
My own guess is that this is not that far-fetched. This is a “generic values hypothesis”, that human values are enough of a blank slate thing that the Internet already redundantly imprints everything relevant that humans share. In this case a random AI with values that are vaguely learning-from-Internet inspired is not much less aligned than a random human, and although that’s not particularly reassuring (value drift can go far when minds are upgraded without a clear architecture that formulates and preserves values), this is a reason for some nontrivial chance of settling on a humane attitude to humanity, which wouldn’t just happen on its own, without cause. This possibility gets more remote if values are engineered de novo and don’t start out as channeling a language model.
humans will most likely go extinct
Without the argument this feels alarmist. Humans can manage their own survival if they are not actively exterminated, it takes a massive disruption as a byproduct of AIs’ activities to prevent that. The possibility of such a disruption is grounded in the convergent instrumental value of resource acquisition and eventual feasibility of megascale engineering, the premises that are not necessarily readily apparent.
My own guess is that this is not that far-fetched.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Without the argument this feels alarmist
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.
I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn’t seem to exist, so here’s my attempt.
Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
Initially, this world looks great from a human perspective, and most people are much richer than they are today.
But things then go badly in one of two ways (or more likely, a combination of both).
[Part 1] Going out with a whimper
In the training process, we used easily-measurable proxy goals as objective functions, that don’t push the AI systems to do what we actually want e.g.
‘maximise positive feedback from your operator’ instead of ‘try to help your operator get what they actually want’
‘reduce reported crimes’ instead of ‘actually prevent crime’
‘increase reported life satisfaction’ instead of ‘actually help humans live good lives’
‘increasing human wealth on paper’ instead of ‘increasing effective human control over resources’
(We did this because ML needs lots of data/feedback to train systems, and you can collect much more data/feedback on easily-measurable objectives.)
Due to competitive pressures, systems continue being deployed despite some people pointing out this is a bad idea.
The goals of AI systems gradually gain more influence over the future relative to human goals.
Eventually, the proxies for which the AI systems are optimising come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory. In the end, we will either go extinct or be mostly disempowered.
(In some sense, this isn’t really a big departure from what is already happening today—just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).
[Part 2] Going out with a bang
These AI systems end up learning objectives that are unrelated to the objective functions used in the training process, because the objective they ended up learning was more naturally discovered during the training process (e.g. “don’t get shut down”).
The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down).
Early in training, the best way to do that is by being obedient (since systems understand that unobedient behaviour would get them shut down).
Then, once the systems become sufficiently capable, they attempt to acquire resources and influence to more effectively achieve their goals, including by eliminating the influence of humans. In the end, humans will most likely go extinct, because the systems have no incentive to preserve our survival.
This is a bit misleading in that the scale of the disaster is more apparent in the regime that takes place some time after this story, when the AI systems are disassembling the Solar System. At that point, humanity would only remain if it’s intentionally maintained, so speaking of “our trajectory” that’s not being steered by our will is too optimistic. And since by then there’s tech that can reconstruct humanity from data, there is even less point in keeping us online than there’s currently for gorillas, it’s feasible to just archive and forget.
Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.
I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here—basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there’s some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.
My own guess is that this is not that far-fetched. This is a “generic values hypothesis”, that human values are enough of a blank slate thing that the Internet already redundantly imprints everything relevant that humans share. In this case a random AI with values that are vaguely learning-from-Internet inspired is not much less aligned than a random human, and although that’s not particularly reassuring (value drift can go far when minds are upgraded without a clear architecture that formulates and preserves values), this is a reason for some nontrivial chance of settling on a humane attitude to humanity, which wouldn’t just happen on its own, without cause. This possibility gets more remote if values are engineered de novo and don’t start out as channeling a language model.
Without the argument this feels alarmist. Humans can manage their own survival if they are not actively exterminated, it takes a massive disruption as a byproduct of AIs’ activities to prevent that. The possibility of such a disruption is grounded in the convergent instrumental value of resource acquisition and eventual feasibility of megascale engineering, the premises that are not necessarily readily apparent.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.