Indeed, I find it somewhat notable that high-level arguments for AI risk rarely attend in detail to the specific structure of an AI’s motivational system, or to the sorts of detailed trade-offs a not-yet-arbitrarily-powerful-AI might face in deciding whether to engage in a given sort of problematic power-seeking. [...] I think my power-seeking report is somewhat guilty in this respect; I tried, in my report on scheming, to do better.
Your 2021 report on power-seeking does not appear to discuss the cost-benefit analysis that a misaligned AI would conduct when considering takeover, or the likelihood that this cost-benefit analysis might not favor takeover. Other people have been pointing that out for a long time, and in this post, it seems you’ve come around on that argument and added some details to it.
It’s admirable that you’ve changed your mind in response to new ideas, and it takes a lot of courage to publicly own mistakes. But given the tremendous influence of your report on power-seeking, I think it’s worth reflecting more on your update that one of its core arguments may have been incorrect or incomplete.
Most centrally, I’d like to point out that several people have already made versions of the argument presented in this post. Some of them have been directly criticizing your 2021 report on power-seeking. You haven’t cited any of them here, but I think it would be worthwhile to recognize their contributions:
Dmitri Gallow
2023: “Were I to be a billionaire, this might help me pursue my ends. But I’m not at all likely to try to become a billionaire, since I don’t value the wealth more than the time it would take to secure the wealth—to say nothing about the probability of failure. In general, whether it’s rational to pursue something is going to depend upon the costs and benefits of the pursuit, as well as the probabilities of success and failure, the costs of failure, and so on”
David Thorstad
2023, about the report: “It is important to separate Likelihood of Goal Satisfaction (LGS) from Goal Pursuit (GP). For suitably sophisticated agents, (LGS) is a nearly trivial claim.
Most agents, including humans, superhumans, toddlers, and toads, would be in a better position to achieve their goals if they had more power and resources under their control… From the fact that wresting power from humanity would help a human, toddler, superhuman or toad to achieve some of their goals, it does not yet follow that the agent is disposed to actually try to disempower all of humanity.
It would therefore be disappointing, to say the least, if Carlsmith were to primarily argue for (LGS) rather than for (ICC-3). However, that appears to be what Carlsmith does...
What we need is an argument that artificial agents for whom power would be useful, and who are aware of this fact are likely to go on to seek enough power to disempower all of humanity. And so far we have literally not seen an argument for this claim.”
Matthew Barnett
January 2024: “Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so.The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints. The agent would be faced with a choice: (1) Attempt to take over the world, and steal everyone’s stuff, or (2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips. The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).”
April 2024: “Skepticism of the treacherous turn: The treacherous turn is the idea that (1) at some point there will be a very smart unaligned AI, (2) when weak, this AI will pretend to be nice, but (3) when sufficiently strong, this AI will turn on humanity by taking over the world by surprise, and then (4) optimize the universe without constraint, which would be very bad for humans.
By comparison, I find it more likely that no individual AI will ever be strong enough to take over the world, in the sense of overthrowing the world’s existing institutions and governments by surprise. Instead, I broadly expect unaligned AIs will integrate into society and try to accomplish their goals by advocating for their legal rights, rather than trying to overthrow our institutions by force. Upon attaining legal personhood, unaligned AIs can utilize their legal rights to achieve their objectives, for example by getting a job and trading their labor for property, within the already-existing institutions. Because the world is not zero sum, and there are economic benefits to scale and specialization, this argument implies that unaligned AIs may well have a net-positive effect on humans, as they could trade with us, producing value in exchange for our own property and services.”
There are important differences between their arguments and yours, such as your focus on the ease of takeover as the key factor in the cost-benefit analysis. But one central argument is the same: in your words, “even for an AI system that estimates some reasonable probability of success at takeover if it goes for it, the strategic calculus may be substantially more complex.”
Why am I pointing this out? Because I think it’s worth keeping track of who’s been right and who’s been wrong in longstanding intellectual debates. Yudkowsky was wrong about takeoff speeds, and Paul was right. Bostrom was wrong about the difficulty of value specification. Given that most people cannot evaluate most debates on the object level (especially debates involving hundreds of pages written by people with PhDs in philosophy), it serves a genuinely useful epistemic function to pay attention to the intellectual track records of people and communities.
Two potential updates here:
On the value of external academic criticism in refining key arguments in the AI risk debate.
On the likelihood that long-held and widespread beliefs in the AI risk community are incorrect.
“Your 2021 report on power-seeking does not appear to discuss the cost-benefit analysis that a misaligned AI would conduct when considering takeover, or the likelihood that this cost-benefit analysis might not favor takeover.”
I don’t think this is quite right. For example: Section 4.3.3 of the report, “Controlling circumstances” focuses on the possibility of ensuring that an AI’s environmental constraints are such that the cost-benefit calculus does not favor problematic power-seeking. Quoting:
So far in section 4.3, I’ve been talking about controlling “internal” properties of an APS system: namely, its objectives and capabilities. But we can control external circumstances, too—and in particular, the type of options and incentives a system faces.
Controlling options means controlling what a circumstance makes it possible for a system to do, even if it tried. Thus, using a computer without internet access might prevent certain types of hacking; a factory robot may not be able to access to the outside world; and so forth.
Controlling incentives, by contrast, means controlling which options it makes sense to choose, given some set of objectives. Thus, perhaps an AI system could impersonate a human, or lie; but if it knows that it will be caught, and that being caught would be costly to its objectives, it might refrain. Or perhaps a system will receive more of a certain kind of reward for cooperating with humans, even though options for misaligned power-seeking are open.
Human society relies heavily on controlling the options and incentives of agents with imperfectly aligned objectives. Thus: suppose I seek money for myself, and Bob seeks money for Bob. This need not be a problem when I hire Bob as a contractor. Rather: I pay him for his work; I don’t give him access to the company bank account; and various social and legal factors reduce his incentives to try to steal from me, even if he could.
A variety of similar strategies will plausibly be available and important with APS systems, too. Note, though, that Bob’s capabilities matter a lot, here. If he was better at hacking, my efforts to avoid giving him the option of accessing the company bank account might (unbeknownst to me) fail. If he was better at avoiding detection, his incentives not to steal might change; and so forth. PS-alignment strategies that rely on controlling options and incentives therefore require ways of exerting this control (e.g., mechanisms of security, monitoring, enforcement, etc) that scale with the capabilities of frontier APS systems. Note, though, that we need not rely solely on human abilities in this respect. For example, we might be able to use various non-APS systems and/or practically-aligned APS systems to help.
The most paradigmatically dangerous types of AI systems plan strategically in pursuit of long-term objectives, since longer time horizons leave more time to gain and use forms of power humans aren’t making readily available, they more easily justify strategic but temporarily costly action (for example, trying to appear adequately aligned, in order to get deployed) aimed at such power. Myopic agentic planners, by contrast, are on a much tighter schedule, and they have consequently weaker incentives to attempt forms of misaligned deception, resource-acquisition, etc that only pay off in the long-run (though even short spans of time can be enough to do a lot of harm, especially for extremely capable systems—and the timespans “short enough to be safe” can alter if what one can do in a given span of time changes).
Less capable systems will also have a harder time getting and keeping power, and a harder time making use of it, so they will have stronger incentives to cooperate with humans (rather than trying to e.g. deceive or overpower them), and to make do with the power and opportunities that humans provide them by default.
I also discuss the cost-benefit dynamic in the section on instrumental convergence (including discussion of trying-to-make-a-billion-dollars as an example), and point people to section 4.3 for more discussion.
I think there is an important point in this vicinity: namely, that power-seeking behavior, in practice, arises not just due to strategically-aware agentic planning, but due to the specific interaction between an agent’s capabilities, objectives, and circumstances. But I don’t think this undermines the posited instrumental connection between strategically-aware agentic planning and power-seeking in general. Humans may not seek various types of power in their current circumstances—in which, for example, their capabilities are roughly similar to those of their peers, they are subject to various social/legal incentives and physical/temporal constraints, and in which many forms of power-seeking would violate ethical constraints they treat as intrinsically important. But almost all humans will seek to gain and maintain various types of power in some circumstances, and especially to the extent they have the capabilities and opportunities to get, use, and maintain that power with comparatively little cost. Thus, for most humans, it makes little sense to devote themselves to starting a billion dollar company—the returns to such effort are too low. But most humans will walk across the street to pick up a billion dollar check.
Put more broadly: the power-seeking behavior humans display, when getting power is easy, seems to me quite compatible with the instrumental convergence thesis. And unchecked by ethics, constraints, and incentives (indeed, even when checked by these things) human power-seeking seems to me plenty dangerous, too. That said, the absence of various forms of overt power-seeking in humans may point to ways we could try to maintain control over less-than-fully PS-aligned APS systems (see 4.3 for more).
That said, I’m happy to acknowledge that the discussion of instrumental convergence in the power-seeking report is one of the weakest parts, on this and other grounds (see footnote for more);[1] that indeed numerous people over the years, including the ones you cite, have pushed back on issues in the vicinity (see e.g. Garfinkel’s 2021 review for another example; also Crawford (2023)); and that this pushback (along with other discussions and pieces of content—e.g., Redwood Research’s work on “control,” Carl Shulman on the Dwarkesh Podcast) has further clarified for me the importance of this aspect of picture. I’ve added some citations in this respect. And I am definitely excited about people (external academics or otherwise) criticizing/refining these arguments—that’s part of why I write these long reports trying to be clear about the state of the arguments as I currently understand them.
The way I’d personally phrase the weakness is: the formulation of instrumental convergence focuses on arguing from “misaligned behavior from an APS system on some inputs” to a default expectation of “misaligned power-seeking from an APS system on some inputs.” I still think this is a reasonable claim, but per the argument in this post (and also per my response to Thorstad here), in order to get to an argument for misaligned power-seeking on the the inputs the AI will actually receive, you do need to engage in a much more holistic evaluation of the difficulty of controlling an AI’s objectives, capabilities, and circumstances enough to prevent problematic power-seeking from being the rational option. Section 4.3 in the report (“The challenge of practical PS-alignment”) is my attempt at this, but I think I should’ve been more explicit about its relationship to the weaker instrumental convergence claim outlined in 4.2, and it’s more of a catalog of challenges than a direct argument for expecting PS-misalignment. And indeed, my current view is that this is roughly the actual argumentative situation. That is, for AIs that aren’t powerful enough to satisfy the “very easy to takeover via a wide variety of methods” condition discussed in the post, I don’t currently think there’s a very clean argument for expecting problematic power-seeking—rather, there is mostly a catalogue of challenges that lead to increasing amounts of concern, the easier takeover becomes. Once you reach systems that are in a position to take over very easily via a wide variety of methods, though, something closer to the recasted classic argument in the post starts to apply (and in fairness, both Bostrom and Yudkowsky, at least, do tend to try to also motivate expecting superintelligences to be capable of this type of takeover—hence the emphasis on decisive strategic advantages).
Your 2021 report on power-seeking does not appear to discuss the cost-benefit analysis that a misaligned AI would conduct when considering takeover, or the likelihood that this cost-benefit analysis might not favor takeover. Other people have been pointing that out for a long time, and in this post, it seems you’ve come around on that argument and added some details to it.
It’s admirable that you’ve changed your mind in response to new ideas, and it takes a lot of courage to publicly own mistakes. But given the tremendous influence of your report on power-seeking, I think it’s worth reflecting more on your update that one of its core arguments may have been incorrect or incomplete.
Most centrally, I’d like to point out that several people have already made versions of the argument presented in this post. Some of them have been directly criticizing your 2021 report on power-seeking. You haven’t cited any of them here, but I think it would be worthwhile to recognize their contributions:
Dmitri Gallow
2023: “Were I to be a billionaire, this might help me pursue my ends. But I’m not at all likely to try to become a billionaire, since I don’t value the wealth more than the time it would take to secure the wealth—to say nothing about the probability of failure. In general, whether it’s rational to pursue something is going to depend upon the costs and benefits of the pursuit, as well as the probabilities of success and failure, the costs of failure, and so on”
David Thorstad
2023, about the report: “It is important to separate Likelihood of Goal Satisfaction (LGS) from Goal Pursuit (GP). For suitably sophisticated agents, (LGS) is a nearly trivial claim.
Most agents, including humans, superhumans, toddlers, and toads, would be in a better position to achieve their goals if they had more power and resources under their control… From the fact that wresting power from humanity would help a human, toddler, superhuman or toad to achieve some of their goals, it does not yet follow that the agent is disposed to actually try to disempower all of humanity.
It would therefore be disappointing, to say the least, if Carlsmith were to primarily argue for (LGS) rather than for (ICC-3). However, that appears to be what Carlsmith does...
What we need is an argument that artificial agents for whom power would be useful, and who are aware of this fact are likely to go on to seek enough power to disempower all of humanity. And so far we have literally not seen an argument for this claim.”
Matthew Barnett
January 2024: “Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so. The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints. The agent would be faced with a choice: (1) Attempt to take over the world, and steal everyone’s stuff, or (2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips. The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).”
April 2024: “Skepticism of the treacherous turn: The treacherous turn is the idea that (1) at some point there will be a very smart unaligned AI, (2) when weak, this AI will pretend to be nice, but (3) when sufficiently strong, this AI will turn on humanity by taking over the world by surprise, and then (4) optimize the universe without constraint, which would be very bad for humans.
By comparison, I find it more likely that no individual AI will ever be strong enough to take over the world, in the sense of overthrowing the world’s existing institutions and governments by surprise. Instead, I broadly expect unaligned AIs will integrate into society and try to accomplish their goals by advocating for their legal rights, rather than trying to overthrow our institutions by force. Upon attaining legal personhood, unaligned AIs can utilize their legal rights to achieve their objectives, for example by getting a job and trading their labor for property, within the already-existing institutions. Because the world is not zero sum, and there are economic benefits to scale and specialization, this argument implies that unaligned AIs may well have a net-positive effect on humans, as they could trade with us, producing value in exchange for our own property and services.”
There are important differences between their arguments and yours, such as your focus on the ease of takeover as the key factor in the cost-benefit analysis. But one central argument is the same: in your words, “even for an AI system that estimates some reasonable probability of success at takeover if it goes for it, the strategic calculus may be substantially more complex.”
Why am I pointing this out? Because I think it’s worth keeping track of who’s been right and who’s been wrong in longstanding intellectual debates. Yudkowsky was wrong about takeoff speeds, and Paul was right. Bostrom was wrong about the difficulty of value specification. Given that most people cannot evaluate most debates on the object level (especially debates involving hundreds of pages written by people with PhDs in philosophy), it serves a genuinely useful epistemic function to pay attention to the intellectual track records of people and communities.
Two potential updates here:
On the value of external academic criticism in refining key arguments in the AI risk debate.
On the likelihood that long-held and widespread beliefs in the AI risk community are incorrect.
I don’t think this is quite right. For example: Section 4.3.3 of the report, “Controlling circumstances” focuses on the possibility of ensuring that an AI’s environmental constraints are such that the cost-benefit calculus does not favor problematic power-seeking. Quoting:
See also the discussion of myopia in 4.3.1.3...
And of “controlling capabilities” in section 4.3.2:
I also discuss the cost-benefit dynamic in the section on instrumental convergence (including discussion of trying-to-make-a-billion-dollars as an example), and point people to section 4.3 for more discussion.
That said, I’m happy to acknowledge that the discussion of instrumental convergence in the power-seeking report is one of the weakest parts, on this and other grounds (see footnote for more);[1] that indeed numerous people over the years, including the ones you cite, have pushed back on issues in the vicinity (see e.g. Garfinkel’s 2021 review for another example; also Crawford (2023)); and that this pushback (along with other discussions and pieces of content—e.g., Redwood Research’s work on “control,” Carl Shulman on the Dwarkesh Podcast) has further clarified for me the importance of this aspect of picture. I’ve added some citations in this respect. And I am definitely excited about people (external academics or otherwise) criticizing/refining these arguments—that’s part of why I write these long reports trying to be clear about the state of the arguments as I currently understand them.
The way I’d personally phrase the weakness is: the formulation of instrumental convergence focuses on arguing from “misaligned behavior from an APS system on some inputs” to a default expectation of “misaligned power-seeking from an APS system on some inputs.” I still think this is a reasonable claim, but per the argument in this post (and also per my response to Thorstad here), in order to get to an argument for misaligned power-seeking on the the inputs the AI will actually receive, you do need to engage in a much more holistic evaluation of the difficulty of controlling an AI’s objectives, capabilities, and circumstances enough to prevent problematic power-seeking from being the rational option. Section 4.3 in the report (“The challenge of practical PS-alignment”) is my attempt at this, but I think I should’ve been more explicit about its relationship to the weaker instrumental convergence claim outlined in 4.2, and it’s more of a catalog of challenges than a direct argument for expecting PS-misalignment. And indeed, my current view is that this is roughly the actual argumentative situation. That is, for AIs that aren’t powerful enough to satisfy the “very easy to takeover via a wide variety of methods” condition discussed in the post, I don’t currently think there’s a very clean argument for expecting problematic power-seeking—rather, there is mostly a catalogue of challenges that lead to increasing amounts of concern, the easier takeover becomes. Once you reach systems that are in a position to take over very easily via a wide variety of methods, though, something closer to the recasted classic argument in the post starts to apply (and in fairness, both Bostrom and Yudkowsky, at least, do tend to try to also motivate expecting superintelligences to be capable of this type of takeover—hence the emphasis on decisive strategic advantages).
Retracted. I apologize for mischaracterizing the report and for the unfair attack on your work.