I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.
Here are some of my thoughts (although these not my only disagreements):
I think the definition of “disempowerment” is vague in a way that fails to distinguish between e.g. (1) “less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well” vs. (2) “humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them”.
These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
I think (1) is OK and I think it’s more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they’ve demonstrated
I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: “Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals.”
Generally speaking, reinforcement learning shouldn’t be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training.
Consequently, there’s no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don’t “get reward”.
But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning.
I think this quote potentially indicates a flawed mental model of AI development underneath: “Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity.”
I think this type of scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens.
By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we’re in a world in which it’s much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
The last point is very important, and follows from a more general principle that the “ability necessary to take over the world” is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: “There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about.”
I don’t think we need to “explicitly specify everything humans tend to care about” into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
Note that I’m not saying that GPT-4 merely understands what you’re requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don’t try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don’t think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
You might reason, “Powerful sub-groups of humans are aligned with each other, which is why they don’t try to take over the world”. But I dispute this hypothesis:
First of all, I don’t think that humans are exactly aligned with the goals of other humans. I think that’s just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers—which could also easily be true of future AIs that are pretrained on our data.
Second of all, I think the most common view in social science is that powerful groups don’t constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don’t try to take over the world because they’re aligned with human values (which I also think is too vague to evaluate meaningfully, if that’s what you’d claim).
You can’t easily counter by saying “no human group has the ability to take over the world” because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don’t attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.
I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.
Here are some of my thoughts (although these not my only disagreements):
I think the definition of “disempowerment” is vague in a way that fails to distinguish between e.g. (1) “less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well” vs. (2) “humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them”.
These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
I think (1) is OK and I think it’s more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they’ve demonstrated
I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: “Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals.”
Generally speaking, reinforcement learning shouldn’t be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training.
Consequently, there’s no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don’t “get reward”.
But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning.
I think this quote potentially indicates a flawed mental model of AI development underneath: “Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity.”
I think this type of scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens.
By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we’re in a world in which it’s much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
The last point is very important, and follows from a more general principle that the “ability necessary to take over the world” is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: “There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about.”
I don’t think we need to “explicitly specify everything humans tend to care about” into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
Note that I’m not saying that GPT-4 merely understands what you’re requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don’t try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don’t think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
You might reason, “Powerful sub-groups of humans are aligned with each other, which is why they don’t try to take over the world”. But I dispute this hypothesis:
First of all, I don’t think that humans are exactly aligned with the goals of other humans. I think that’s just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers—which could also easily be true of future AIs that are pretrained on our data.
Second of all, I think the most common view in social science is that powerful groups don’t constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don’t try to take over the world because they’re aligned with human values (which I also think is too vague to evaluate meaningfully, if that’s what you’d claim).
You can’t easily counter by saying “no human group has the ability to take over the world” because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don’t attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.