Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM’s advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.
I found this interesting so just leaving some thoughts here:
- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.
-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:
Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it.
being pre-trained on text consistent with power-seeking behaviour.
having learnt power-seeking during fine-tuning.
I think “1. ” is the key dynamic at play here. Without this, the extra agent is not required at all. This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don’t see any easy way of accomplishing this without introducing many more failure modes though.
For “2.” I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)
I think “3. ” is what most people focus on when they are asking “Why would a model power-seek upon deployment if it never had the opportunity to do so during training?”. Especially since almost all discourse in power-seeking has been in the RL context.
Regarding 3, yeah, I definitely don’t want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.
Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I’m pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.
I found this interesting so just leaving some thoughts here:
- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.
-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:
Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it.
being pre-trained on text consistent with power-seeking behaviour.
having learnt power-seeking during fine-tuning.
I think “1. ” is the key dynamic at play here. Without this, the extra agent is not required at all. This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don’t see any easy way of accomplishing this without introducing many more failure modes though.
For “2.” I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)
I think “3. ” is what most people focus on when they are asking “Why would a model power-seek upon deployment if it never had the opportunity to do so during training?”. Especially since almost all discourse in power-seeking has been in the RL context.
Regarding 3, yeah, I definitely don’t want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.
Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I’m pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.