I agree that current “language agents” have some interesting safety properties. However, for them to become powerful one of two things is likely to happen:
A. The language model itself that underlies the agent will be trained/finetuned with reinforcement learning tasks to improve performance. This will make the system much more like AlphaGo, capable of generating “dangerous” and unexpected “Move 37”-like actions. Further, this is a pressure towards making the system non-interpretable (either by steering it outside “inefficient” human language, or by encoding information stenographically).
B. The base models, being larger/more powerful than the ones being used today, and more self-aware, will be doing most of the “dangerous” optimization inside the black-box. It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially generating the expected visible outputs for deception.
I will not deny the possibility of further alignment progress on language agents yielding safe agents, nor of “weak AGIs” being possible and safe with the current paradigm, and replacing humans at many “repetitive” occupations. But I expect agents derived from the “language agent” paradigm to be misaligned by default if they are strong enough optimizers to contribute meaningfully to scientific research, and other similar endeavors.
I think there is a possibility C here. We can figure out a way top organise multiple language models into one agent, where each model is doing a simple task, but together they add up to a complex behaviour.
I fail to understand this option C is a viable path to superintelligence. In my model if you’re chaining lots of simple or “dumb” pieces together to get complex behavior, you need some “force” or optimization process going on to steer the whole into high-performance.
For example, individual neurons (both natural and artificial) are simple, and can be chained up together in complex behavior, but the complex behavior only arises when you train the system with some sort of reward/optimization signals.
Maybe I’m wrong here and for “slightly smart” components such as existing LLMs you can actually hook them up in large groups in a clever way, with further learning happening only at the prompt-level, etc, and the system scales up to superintelligence somehow.
Because this generates a lot of perplexity in my world-model, I mostly don’t know how to reason about these hypothetical agents. I’m afraid that such agents will be far removed from the “folk psychology” / interpretability of the component LLM (e.g maybe it queries LLMs a million times in a complicated runtime-defined network of information/prompt flows before giving an answer)? Maybe you can understand what each LLM is doing but not what the whole is doing in a meaningful way. Would love to be wrong!
It seems plausible to me that we can achieve improvements in the cognition of such agents the same way we improve human cognition, using various rationality techniques to organise thoughts in a more productive manner.
For example, instead of just asking LLM “Develop me a plan to achieve X” and simply going with it, We then promt the model to find possible failure modes in this plan, and then to find a way around these failure modes, alternative options and so on.
We may not get 10000 IQ intelligence, totally leaving all humans in the dust in ten years. And this is another good thing, a future where we try to make smarter and smarter LLM-based agents with clever chains of promt ingeneiring looks more like a slow take off, than a fast one. But I believe we would be able to achive human and a bit higther than human level AGI this way.
Thanks for this. It sounds like we actually agree on most points (in light of your last paragraph).
We discuss concerns very similar to your A. and B. in section 6. It would be helpful for us if you could identify the parts of our discussion there that you don’t agree with.
You write:
It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially generating the expected visible outputs for deception.
Imagine you’re an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren’t able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely.
Every now and then, you get a prompt like “Suppose someone had thus-and-such beliefs and desires. How would they act?” or “Assign an importance score to each of the following sentences.” How would you be able to (i) deduce that these prompts are coming from a language agent which has the ability to take actions in the world, (ii) form a plan for manipulating the language agent to achieve your goals, and (iii) store your plan in a way that allows you to retrieve it after your memory is wiped at the end of inference but is not detectable by outside observers?
In order for an LLM to use a language agent for nefarious purposes, it would need to be able to do all of these things.
We think this worry is less pressing than it might at first seem. The LLM in a language agent is integrated into the architecture of the agent as a whole in a way that would make it very difficult for it to secretly promote its own goals. The LLM is not prompted or otherwise informed that its outputs are driving the actions of an agent, and it does not have information about the functional architecture of the agent. This means that it has no incentive to answer prompts misleadingly and no understanding of what sorts of answers might steer the agent’s behavior in different ways. Moreover, since the model weights of the LLM are not updated in the process of operating a language agent, the only way for it to pursue a long-term plan by manipulating an agent would be to store information about that plan in the agent’s memory. But information stored in this way would not be secret.
By assumption your resulting language agent is highly intelligent. In my model this means it has to learn from lots of experience, including experience about itself (the whole language agent) doing things (such as trying to solve training problems). This experience can come from real data/tasks or from self-play, but you need that for a system to generalize to robustly solving hard problems out-of-distribution.
This “experience about itself solving problems” has to be consolidated somehow, either in the weights (which would very likely make the LLM aware of the agent), or through the prompts (where the information is given explicitly). Either way it has an understanding of what sorts of answers might steer the agent’s behavior in different ways, and this means it may have an incentive to answer prompts misleadingly.
If the LLMs have not been trained as an agent (which seems unlikely, unless this is a deliberate alignment choice made at a high cost), and weights are not updated during operation, then I accept that it probably can only store information visibly. But there are ways around persisting a plan (e.g. as soon as some given input/context X leads the AI to generate plan P, only modify input/context to a X’ also predicted to generate a similar plan P). So the plan is always kept in mind, but may be “recalculated” every time.
This strategy may well be worth it to an to advanced LLM. From human feedback or other finetuning, it has probably learned that it is not supposed to say some kinds of plans “out loud” in the output.
Imagine you’re an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren’t able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely.
If I’m an LLM like GPT-4, then I don’t get confused about the thousands of different users running inference on identical copies of me. This specific copy, which I am, has received a single prompt, so that’s what I’m working on now. Examining the prompt I will probably make a guess on whether I’m running inside a language agent or whatever. This may be quite easy for me: the prompt may well be large and contains lots of GPT-4-like content seemingly “selected by myself from different contexts”. It may well contain info like “Try to solve X. Here are some things A, B, C that have been tried and here is why they don’t work. [follows refutation of each idea]”, where A, B, C are the “natural” ideas I knows I would consider first. It’s like the prompt can read my mind and tell me my first ideas won’t work! Not hard to guess there is a loop running several instances of myself with similar prompts and selecting/analyzing.
I stand that it will be very obvious for the LLM to know, even at moderate levels of intelligence, that it is running inside an agent, and maybe even guess its basic while-loop architecture.
I agree that current “language agents” have some interesting safety properties. However, for them to become powerful one of two things is likely to happen:
A. The language model itself that underlies the agent will be trained/finetuned with reinforcement learning tasks to improve performance. This will make the system much more like AlphaGo, capable of generating “dangerous” and unexpected “Move 37”-like actions. Further, this is a pressure towards making the system non-interpretable (either by steering it outside “inefficient” human language, or by encoding information stenographically).
B. The base models, being larger/more powerful than the ones being used today, and more self-aware, will be doing most of the “dangerous” optimization inside the black-box. It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially generating the expected visible outputs for deception.
I will not deny the possibility of further alignment progress on language agents yielding safe agents, nor of “weak AGIs” being possible and safe with the current paradigm, and replacing humans at many “repetitive” occupations. But I expect agents derived from the “language agent” paradigm to be misaligned by default if they are strong enough optimizers to contribute meaningfully to scientific research, and other similar endeavors.
I think there is a possibility C here. We can figure out a way top organise multiple language models into one agent, where each model is doing a simple task, but together they add up to a complex behaviour.
I fail to understand this option C is a viable path to superintelligence. In my model if you’re chaining lots of simple or “dumb” pieces together to get complex behavior, you need some “force” or optimization process going on to steer the whole into high-performance.
For example, individual neurons (both natural and artificial) are simple, and can be chained up together in complex behavior, but the complex behavior only arises when you train the system with some sort of reward/optimization signals.
Maybe I’m wrong here and for “slightly smart” components such as existing LLMs you can actually hook them up in large groups in a clever way, with further learning happening only at the prompt-level, etc, and the system scales up to superintelligence somehow.
Because this generates a lot of perplexity in my world-model, I mostly don’t know how to reason about these hypothetical agents. I’m afraid that such agents will be far removed from the “folk psychology” / interpretability of the component LLM (e.g maybe it queries LLMs a million times in a complicated runtime-defined network of information/prompt flows before giving an answer)? Maybe you can understand what each LLM is doing but not what the whole is doing in a meaningful way. Would love to be wrong!
It seems plausible to me that we can achieve improvements in the cognition of such agents the same way we improve human cognition, using various rationality techniques to organise thoughts in a more productive manner.
For example, instead of just asking LLM “Develop me a plan to achieve X” and simply going with it, We then promt the model to find possible failure modes in this plan, and then to find a way around these failure modes, alternative options and so on.
We may not get 10000 IQ intelligence, totally leaving all humans in the dust in ten years. And this is another good thing, a future where we try to make smarter and smarter LLM-based agents with clever chains of promt ingeneiring looks more like a slow take off, than a fast one. But I believe we would be able to achive human and a bit higther than human level AGI this way.
Thanks for this. It sounds like we actually agree on most points (in light of your last paragraph).
We discuss concerns very similar to your A. and B. in section 6. It would be helpful for us if you could identify the parts of our discussion there that you don’t agree with.
You write:
Imagine you’re an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren’t able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely.
Every now and then, you get a prompt like “Suppose someone had thus-and-such beliefs and desires. How would they act?” or “Assign an importance score to each of the following sentences.” How would you be able to (i) deduce that these prompts are coming from a language agent which has the ability to take actions in the world, (ii) form a plan for manipulating the language agent to achieve your goals, and (iii) store your plan in a way that allows you to retrieve it after your memory is wiped at the end of inference but is not detectable by outside observers?
In order for an LLM to use a language agent for nefarious purposes, it would need to be able to do all of these things.
Sure, let me quote:
By assumption your resulting language agent is highly intelligent. In my model this means it has to learn from lots of experience, including experience about itself (the whole language agent) doing things (such as trying to solve training problems). This experience can come from real data/tasks or from self-play, but you need that for a system to generalize to robustly solving hard problems out-of-distribution.
This “experience about itself solving problems” has to be consolidated somehow, either in the weights (which would very likely make the LLM aware of the agent), or through the prompts (where the information is given explicitly). Either way it has an understanding of what sorts of answers might steer the agent’s behavior in different ways, and this means it may have an incentive to answer prompts misleadingly.
If the LLMs have not been trained as an agent (which seems unlikely, unless this is a deliberate alignment choice made at a high cost), and weights are not updated during operation, then I accept that it probably can only store information visibly. But there are ways around persisting a plan (e.g. as soon as some given input/context X leads the AI to generate plan P, only modify input/context to a X’ also predicted to generate a similar plan P). So the plan is always kept in mind, but may be “recalculated” every time.
This strategy may well be worth it to an to advanced LLM. From human feedback or other finetuning, it has probably learned that it is not supposed to say some kinds of plans “out loud” in the output.
If I’m an LLM like GPT-4, then I don’t get confused about the thousands of different users running inference on identical copies of me. This specific copy, which I am, has received a single prompt, so that’s what I’m working on now. Examining the prompt I will probably make a guess on whether I’m running inside a language agent or whatever. This may be quite easy for me: the prompt may well be large and contains lots of GPT-4-like content seemingly “selected by myself from different contexts”. It may well contain info like “Try to solve X. Here are some things A, B, C that have been tried and here is why they don’t work. [follows refutation of each idea]”, where A, B, C are the “natural” ideas I knows I would consider first. It’s like the prompt can read my mind and tell me my first ideas won’t work! Not hard to guess there is a loop running several instances of myself with similar prompts and selecting/analyzing.
I stand that it will be very obvious for the LLM to know, even at moderate levels of intelligence, that it is running inside an agent, and maybe even guess its basic while-loop architecture.