There have been a few takes so far of humans gradually losing control to AIs—not through specific systems going clearly wrong, but rather by a long-term process of increasing complexity and incentives.
This sometimes gets classified as “systematic” failures—in comparison to “misuse” and “misalignment.”
There was “What Failure Looks Like”, and more recently, this piece on “Gradual Disempowerment.”
To me, these pieces come across as highly hand-wavy, speculative, and questionable.
I get the impression that a lot of people have strong low-level assumptions here that a world with lots of strong AIs must go haywire. But I don’t see clear steps to get there or a clear model of what the critical factors are.
As I see it, there are many worlds where AIs strictly outperform humans at managing high levels of complexity and increasing coordination. In many of these, things go toward much better worlds than ones with humans in charge.
I think it’s likely that inequality could increase, but that wouldn’t mean humanity as a whole would lose control.
My gut-level guess is that there are some crucial aspects here. Like, in worlds where AI systems have strong epistemics without critical large gaps, and can generally be controlled / aligned, things will be fine. But if there are fundamental gaps for technical or political reasons, then that could lead to these “systemic” disasters.
If that is the case, I’d expect we could come up with clear benchmarks to keep track of. For example, one might say that future global well-being is highly sensitive to a factor like, “how well the average-used AI service does at wisdom exam #523.”
I think your central point is that we should clarify these scenarios, and I very much agree.
I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to “go run this company according to the goals laid out in its corporate constitution” which winds up being just make all the money you can.
The first danger can be met with: for the love of god, get alignment right and don’t use an idiotic target like “follow the laws of the nation you originated in but otherwise do whatever you like.” It seems like this type of failure is a fear of an entire world that has paid zero attention to the warnings from worriers that AI will keep improving and following its goals to the extreme. I don’t think we’ll sleepwalk into that scenario.
The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I guess I find the first scenario too foolish for even humans to fall into. Building AI with one of the exact goals people have been warning you about forever, “just make money”, is just too dumb.
But the second seems all too plausible in a world with widely proliferated intent-aligned AGI. I can see us arriving at autonomous AI/AGI with some level of intent alignment and assuming we can always go back and tell the AI to stand down, then getting complacent and discovering that it’s not really as corrigible as you hoped after it’s learned and changed its beliefs about things like “following instructions”.
The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I’d flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, “follow the laws of the nation you originated in but otherwise do whatever you like”, they should see forecasts for what that would lead to. If there’s any clearly problematic outcomes, those should be apparent early on.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
I think “very dumb humans” is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations “very dumb humans” would mess up, and take the corresponding preventative measures.
A world where many small pockets of “highly dumb humans” could cause an existential catastrophe is one that’s very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas—like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage.
It’s possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don’t see it yet.
How do you propose that reasonable actors prevent reality from being fragile and dangerous?
Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols?
As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.
a lot of people have strong low-level assumptions here that a world with lots of strong AIs must go haywire.
For myself, it seems clear that the world has ALREADY gone haywire. Individual humans have lost control of most of our lives—we interact with policies, faceless (or friendly but volition-free) workers following procedure, automated systems, etc. These systems are human-implemented, but in most cases too complex to be called human-controlled. Moloch won.
Big corporations are a form of inhuman intelligence, and their software and operations have eaten the world. AI pushes this well past a tipping point. It’s probably already irreversable without a major civilizational collapse, but it can still get … more so.
in worlds where AI systems have strong epistemics without critical large gaps, and can generally be controlled / aligned, things will be fine.
I don’t have good working definitions of “controlled/aligned” that would make this true. I don’t see any large-scale institutions or groups large and sane enough to have a reasonable CEV, so I don’t know what an AI could align with or be controlled by.
I feel like you’re talking in highly absolutist terms here.
Global wealth is $454.4 trillion. We currently have ~8 Bil humans, with an average happiness of say 6⁄10. Global wealth and most other measures of civilization flourishing that I know of seem to be generally going up over time.
I think that our world makes a lot of mistakes and fails a lot at coordination. It’s very easy for me to imagine that we could increase global wealth by 3x if we do a decent job.
So how bad are things now? Well, approximately, “We have the current world, at $454 Trillion, with 8 billion humans, etc”. To me that’s definitely something to work with.
I feel like you’re talking in highly absolutist terms here.
You’re correct, and I apologize for that. There are plenty of potential good outcomes where individual autonomy reverses the trend of the last ~70 years. Or where the systemic takeover plateaus at the current level, and the main change is more wealth and options for individuals. Or where AI does in fact enable many/most individual humans to make meaningful decisions and contributions where they don’t today.
I mostly want to point out that many disempowerment/dystopia failure scenarios don’t require a step-change from AI, just an acceleration of current trends.
I mostly want to point out that many disempowerment/dystopia failure scenarios don’t require a step-change from AI, just an acceleration of current trends.
Do you think that the world is getting worse each year?
My rough take is that humans, especially rich humans, are generally more and more successful.
I’m sure there are ways for current trends to lead to catastrophe—line some trends dramatically increasing and others decreasing, but that seems like it would require a lengthy and precise argument.
Do you think that the world is getting worse each year?
Good clarification question! My answer probably isn’t satisfying, though. “It’s complicated” (meaning: multidimensional and not ordinally comparable).
On a lot of metrics, it’s better by far, for most of the distribution. On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
I think it’s likely that many people are panicking and losing hope each year. There’s a lot of grim media around.
I’m far less sold that something like “civilizational agency” is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improving in effectiveness.
One challenge I’d have for you / others who feel similar to you, is to try to get more concrete on measures like this, and then to show that they have been declining.
My personal guess is that a bunch of people are incredibly anxious over the state of the world, largely for reasons of media attention, and then this spills over into them assuming major global ramifications without many concrete details or empirical forecasts.
One challenge I’d have for you / others who feel similar to you, is to try to get more concrete on measures like this, and then to show that they have been declining.
I’ve given some thought to this over the last few decades, and have yet to find ANY satisfying measures, let alone a good set. I reject the trap of “if it’s not objective and quantitative, it’s not important”—that’s one of the underlying attitudes causing the decline.
I definitely acknowledge that my memory of the last quarter of the previous century is fuzzy and selective, and beyond that is secondhand and not-well-supported. But I also don’t deny my own experience that the (tiny subset of humanity) people I am aware of as individuals have gotten much less hopeful and agentic over time. This may well be for reasons of media attention, but that doesn’t make it not real.
There have been a few takes so far of humans gradually losing control to AIs—not through specific systems going clearly wrong, but rather by a long-term process of increasing complexity and incentives.
This sometimes gets classified as “systematic” failures—in comparison to “misuse” and “misalignment.”
There was “What Failure Looks Like”, and more recently, this piece on “Gradual Disempowerment.”
To me, these pieces come across as highly hand-wavy, speculative, and questionable.
I get the impression that a lot of people have strong low-level assumptions here that a world with lots of strong AIs must go haywire. But I don’t see clear steps to get there or a clear model of what the critical factors are.
As I see it, there are many worlds where AIs strictly outperform humans at managing high levels of complexity and increasing coordination. In many of these, things go toward much better worlds than ones with humans in charge.
I think it’s likely that inequality could increase, but that wouldn’t mean humanity as a whole would lose control.
My gut-level guess is that there are some crucial aspects here. Like, in worlds where AI systems have strong epistemics without critical large gaps, and can generally be controlled / aligned, things will be fine. But if there are fundamental gaps for technical or political reasons, then that could lead to these “systemic” disasters.
If that is the case, I’d expect we could come up with clear benchmarks to keep track of. For example, one might say that future global well-being is highly sensitive to a factor like, “how well the average-used AI service does at wisdom exam #523.”
https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like
https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from
I think your central point is that we should clarify these scenarios, and I very much agree.
I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to “go run this company according to the goals laid out in its corporate constitution” which winds up being just make all the money you can.
The first danger can be met with: for the love of god, get alignment right and don’t use an idiotic target like “follow the laws of the nation you originated in but otherwise do whatever you like.” It seems like this type of failure is a fear of an entire world that has paid zero attention to the warnings from worriers that AI will keep improving and following its goals to the extreme. I don’t think we’ll sleepwalk into that scenario.
The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I guess I find the first scenario too foolish for even humans to fall into. Building AI with one of the exact goals people have been warning you about forever, “just make money”, is just too dumb.
But the second seems all too plausible in a world with widely proliferated intent-aligned AGI. I can see us arriving at autonomous AI/AGI with some level of intent alignment and assuming we can always go back and tell the AI to stand down, then getting complacent and discovering that it’s not really as corrigible as you hoped after it’s learned and changed its beliefs about things like “following instructions”.
I’d flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, “follow the laws of the nation you originated in but otherwise do whatever you like”, they should see forecasts for what that would lead to. If there’s any clearly problematic outcomes, those should be apparent early on.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
I think “very dumb humans” is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations “very dumb humans” would mess up, and take the corresponding preventative measures.
A world where many small pockets of “highly dumb humans” could cause an existential catastrophe is one that’s very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas—like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage.
It’s possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don’t see it yet.
How do you propose that reasonable actors prevent reality from being fragile and dangerous?
Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols?
As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.
For myself, it seems clear that the world has ALREADY gone haywire. Individual humans have lost control of most of our lives—we interact with policies, faceless (or friendly but volition-free) workers following procedure, automated systems, etc. These systems are human-implemented, but in most cases too complex to be called human-controlled. Moloch won.
Big corporations are a form of inhuman intelligence, and their software and operations have eaten the world. AI pushes this well past a tipping point. It’s probably already irreversable without a major civilizational collapse, but it can still get … more so.
I don’t have good working definitions of “controlled/aligned” that would make this true. I don’t see any large-scale institutions or groups large and sane enough to have a reasonable CEV, so I don’t know what an AI could align with or be controlled by.
I feel like you’re talking in highly absolutist terms here.
Global wealth is $454.4 trillion. We currently have ~8 Bil humans, with an average happiness of say 6⁄10. Global wealth and most other measures of civilization flourishing that I know of seem to be generally going up over time.
I think that our world makes a lot of mistakes and fails a lot at coordination. It’s very easy for me to imagine that we could increase global wealth by 3x if we do a decent job.
So how bad are things now? Well, approximately, “We have the current world, at $454 Trillion, with 8 billion humans, etc”. To me that’s definitely something to work with.
You’re correct, and I apologize for that. There are plenty of potential good outcomes where individual autonomy reverses the trend of the last ~70 years. Or where the systemic takeover plateaus at the current level, and the main change is more wealth and options for individuals. Or where AI does in fact enable many/most individual humans to make meaningful decisions and contributions where they don’t today.
I mostly want to point out that many disempowerment/dystopia failure scenarios don’t require a step-change from AI, just an acceleration of current trends.
Do you think that the world is getting worse each year?
My rough take is that humans, especially rich humans, are generally more and more successful.
I’m sure there are ways for current trends to lead to catastrophe—line some trends dramatically increasing and others decreasing, but that seems like it would require a lengthy and precise argument.
Good clarification question! My answer probably isn’t satisfying, though. “It’s complicated” (meaning: multidimensional and not ordinally comparable).
On a lot of metrics, it’s better by far, for most of the distribution. On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
Thanks for the specificity!
> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
I think it’s likely that many people are panicking and losing hope each year. There’s a lot of grim media around.
I’m far less sold that something like “civilizational agency” is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improving in effectiveness.
One challenge I’d have for you / others who feel similar to you, is to try to get more concrete on measures like this, and then to show that they have been declining.
My personal guess is that a bunch of people are incredibly anxious over the state of the world, largely for reasons of media attention, and then this spills over into them assuming major global ramifications without many concrete details or empirical forecasts.
I’ve given some thought to this over the last few decades, and have yet to find ANY satisfying measures, let alone a good set. I reject the trap of “if it’s not objective and quantitative, it’s not important”—that’s one of the underlying attitudes causing the decline.
I definitely acknowledge that my memory of the last quarter of the previous century is fuzzy and selective, and beyond that is secondhand and not-well-supported. But I also don’t deny my own experience that the (tiny subset of humanity) people I am aware of as individuals have gotten much less hopeful and agentic over time. This may well be for reasons of media attention, but that doesn’t make it not real.