Thanks for this response, I’m glad to see more public debate on this!
The part of Katja’s part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed—there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?
I think the crux here is whether we expect AI systems to by default collude with one another. They might—they have a lot of things in common that humans don’t, especially if they’re copies of one another! But coordination in general is hard, especially if it has to be surreptitious.
As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.
This isn’t totally fanciful—the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances—if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d’etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it’s also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they’ll really have your back if you support them?
An interesting consequence of this is that it’s ambiguous whether making AI more cooperative makes the situation better or worse.
Interesting points, I agree that our response to part C doesn’t address this well.
AI’s colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it’s the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the “AIs take part in human societal structures indefinitely” hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It’s harder to make an airtight x-risk argument using fast takeoff, since I don’t think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we’re figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).
A more robust worry (and what I’d probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don’t take over the world in obvious, visible ways. They usually don’t break laws in ways we’d notice, they don’t kill humans, etc. On paper, humans “own” the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively “take over the world” (in the sense that humans don’t actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren’t great at collaborating against humanity.
Addressing a few specific points:
humanity will get better at aligning and controlling AI systems as we gain more experience with them,
True to some extent, but I’d expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:
and we may be able to enlist the help of AI systems to keep others in check.
Yeah, that’s an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don’t think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.
I agree that in a fast takeoff scenario there’s little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I’m personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans “owning” the economy but still having trouble getting what they want, it’s not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I’ve bought a dud on the basis of canny marketing, overall I’m much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there’s a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, “slightly better than 2022” and “all humans dead” are rounding errors relative to “possible future human flourishing”. But things look quite different under other ethical views, so I’m reluctant to conflate these outcomes.
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity’s potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.
E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.
I agree that aligned AI could also make humans irrelevant, but not sure how that’s related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn’t seem important for that argument, but maybe I’m misunderstanding what you’re saying.
Thanks for this response, I’m glad to see more public debate on this!
The part of Katja’s part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed—there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?
I think the crux here is whether we expect AI systems to by default collude with one another. They might—they have a lot of things in common that humans don’t, especially if they’re copies of one another! But coordination in general is hard, especially if it has to be surreptitious.
As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.
This isn’t totally fanciful—the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances—if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d’etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it’s also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they’ll really have your back if you support them?
An interesting consequence of this is that it’s ambiguous whether making AI more cooperative makes the situation better or worse.
Interesting points, I agree that our response to part C doesn’t address this well.
AI’s colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it’s the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the “AIs take part in human societal structures indefinitely” hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It’s harder to make an airtight x-risk argument using fast takeoff, since I don’t think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we’re figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).
A more robust worry (and what I’d probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don’t take over the world in obvious, visible ways. They usually don’t break laws in ways we’d notice, they don’t kill humans, etc. On paper, humans “own” the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively “take over the world” (in the sense that humans don’t actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren’t great at collaborating against humanity.
Addressing a few specific points:
True to some extent, but I’d expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:
Yeah, that’s an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don’t think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.
I agree that in a fast takeoff scenario there’s little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I’m personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans “owning” the economy but still having trouble getting what they want, it’s not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I’ve bought a dud on the basis of canny marketing, overall I’m much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there’s a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, “slightly better than 2022” and “all humans dead” are rounding errors relative to “possible future human flourishing”. But things look quite different under other ethical views, so I’m reluctant to conflate these outcomes.
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity’s potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.
E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.
I agree that aligned AI could also make humans irrelevant, but not sure how that’s related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn’t seem important for that argument, but maybe I’m misunderstanding what you’re saying.