For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?
Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won’t be systematically deceiving humans to pursue some particular agenda of the agent.
As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.
For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it’s economically profitable to build AI which does those things. (That’s kinda the point of AI, after all.)
In a world where most stuff is run by AI (because it’s economically profitable to do so), and there’s RLHF-style direct incentives for those AIs to deceive humans… well, that’s the starting point to the Getting What You Measure scenario.
Insofar as power-seeking incentives enter the picture, it seems to me like the “minimal assumptions” entry point is not consequentialist reasoning within the AI, but rather economic selection pressures. If we’re using lots of AIs to do economically-profitable things, well, AIs which deceive us in power-seeking ways (whether “intentional” or not) will tend to make more profit, and therefore there will be selection pressure for those AIs in the same way that there’s selection pressure for profitable companies. Dial up the capabilities and widespread AI use, and that again looks like Getting What We Measure. (Related: the distinction here is basically the AI version of the distinction made in Unconscious Economics.)
This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.
Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won’t be systematically deceiving humans to pursue some particular agenda of the agent.
As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.
For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it’s economically profitable to build AI which does those things. (That’s kinda the point of AI, after all.)
In a world where most stuff is run by AI (because it’s economically profitable to do so), and there’s RLHF-style direct incentives for those AIs to deceive humans… well, that’s the starting point to the Getting What You Measure scenario.
Insofar as power-seeking incentives enter the picture, it seems to me like the “minimal assumptions” entry point is not consequentialist reasoning within the AI, but rather economic selection pressures. If we’re using lots of AIs to do economically-profitable things, well, AIs which deceive us in power-seeking ways (whether “intentional” or not) will tend to make more profit, and therefore there will be selection pressure for those AIs in the same way that there’s selection pressure for profitable companies. Dial up the capabilities and widespread AI use, and that again looks like Getting What We Measure. (Related: the distinction here is basically the AI version of the distinction made in Unconscious Economics.)
This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.