Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM’s advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.
I found this interesting so just leaving some thoughts here:
- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.
-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:
Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it.
being pre-trained on text consistent with power-seeking behaviour.
having learnt power-seeking during fine-tuning.
I think “1. ” is the key dynamic at play here. Without this, the extra agent is not required at all. This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don’t see any easy way of accomplishing this without introducing many more failure modes though.
For “2.” I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)
I think “3. ” is what most people focus on when they are asking “Why would a model power-seek upon deployment if it never had the opportunity to do so during training?”. Especially since almost all discourse in power-seeking has been in the RL context.
Regarding 3, yeah, I definitely don’t want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.
Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I’m pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.
The rest was better, though I think that the more typical framing of this argument is better—what this is really about is models in RL. The thought experiment can be made closer to real-life AI by talking about model-based RL, and more tenuous arguments can be made about whether learning a model is convergent even for nominally model-free RL.
Is your issue just “Alice’s first sentence is so misguided that no self-respecting safety researcher would say such a thing”? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:
Bob: I’m asking you why models should misgeneralise in the extremely specific weird way that you mentioned
expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like “maximising reward”. This much is obvious by the example of individual humans not maximising inclusive genetic fitness.
But, as noted in the comments on Reward Is Not the Optimization Target, it seems plausible that some models really do learn at least some behaviours that are more-or-less what we’d naively expect from a reward-maximiser. E.g. Paul Christiano writes:
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time.
The purpose of Alice’s thought experiment is precisely to give such an example, where a deployed model quite plausibly displays the sort of reward-maximiser behaviour one might’ve naively expected (in this case, power-seeking).
I found this interesting so just leaving some thoughts here:
- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.
-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:
Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it.
being pre-trained on text consistent with power-seeking behaviour.
having learnt power-seeking during fine-tuning.
I think “1. ” is the key dynamic at play here. Without this, the extra agent is not required at all. This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don’t see any easy way of accomplishing this without introducing many more failure modes though.
For “2.” I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)
I think “3. ” is what most people focus on when they are asking “Why would a model power-seek upon deployment if it never had the opportunity to do so during training?”. Especially since almost all discourse in power-seeking has been in the RL context.
Regarding 3, yeah, I definitely don’t want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.
Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I’m pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.
I almost stopped reading after Alice’s first sentence because https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target
The rest was better, though I think that the more typical framing of this argument is better—what this is really about is models in RL. The thought experiment can be made closer to real-life AI by talking about model-based RL, and more tenuous arguments can be made about whether learning a model is convergent even for nominally model-free RL.
Is your issue just “Alice’s first sentence is so misguided that no self-respecting safety researcher would say such a thing”? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:
expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like “maximising reward”. This much is obvious by the example of individual humans not maximising inclusive genetic fitness.
But, as noted in the comments on Reward Is Not the Optimization Target, it seems plausible that some models really do learn at least some behaviours that are more-or-less what we’d naively expect from a reward-maximiser. E.g. Paul Christiano writes:
The purpose of Alice’s thought experiment is precisely to give such an example, where a deployed model quite plausibly displays the sort of reward-maximiser behaviour one might’ve naively expected (in this case, power-seeking).