Interesting points, I agree that our response to part C doesn’t address this well.
AI’s colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it’s the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the “AIs take part in human societal structures indefinitely” hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It’s harder to make an airtight x-risk argument using fast takeoff, since I don’t think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we’re figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).
A more robust worry (and what I’d probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don’t take over the world in obvious, visible ways. They usually don’t break laws in ways we’d notice, they don’t kill humans, etc. On paper, humans “own” the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively “take over the world” (in the sense that humans don’t actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren’t great at collaborating against humanity.
Addressing a few specific points:
humanity will get better at aligning and controlling AI systems as we gain more experience with them,
True to some extent, but I’d expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:
and we may be able to enlist the help of AI systems to keep others in check.
Yeah, that’s an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don’t think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.
I agree that in a fast takeoff scenario there’s little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I’m personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans “owning” the economy but still having trouble getting what they want, it’s not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I’ve bought a dud on the basis of canny marketing, overall I’m much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there’s a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, “slightly better than 2022” and “all humans dead” are rounding errors relative to “possible future human flourishing”. But things look quite different under other ethical views, so I’m reluctant to conflate these outcomes.
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity’s potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.
E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.
I agree that aligned AI could also make humans irrelevant, but not sure how that’s related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn’t seem important for that argument, but maybe I’m misunderstanding what you’re saying.
Interesting points, I agree that our response to part C doesn’t address this well.
AI’s colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it’s the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the “AIs take part in human societal structures indefinitely” hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It’s harder to make an airtight x-risk argument using fast takeoff, since I don’t think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we’re figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).
A more robust worry (and what I’d probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don’t take over the world in obvious, visible ways. They usually don’t break laws in ways we’d notice, they don’t kill humans, etc. On paper, humans “own” the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively “take over the world” (in the sense that humans don’t actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren’t great at collaborating against humanity.
Addressing a few specific points:
True to some extent, but I’d expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:
Yeah, that’s an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don’t think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.
I agree that in a fast takeoff scenario there’s little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I’m personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans “owning” the economy but still having trouble getting what they want, it’s not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I’ve bought a dud on the basis of canny marketing, overall I’m much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there’s a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, “slightly better than 2022” and “all humans dead” are rounding errors relative to “possible future human flourishing”. But things look quite different under other ethical views, so I’m reluctant to conflate these outcomes.
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity’s potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.
E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.
I agree that aligned AI could also make humans irrelevant, but not sure how that’s related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn’t seem important for that argument, but maybe I’m misunderstanding what you’re saying.