Currently the most powerful techniques for getting a language model to act as an agent is via RLHF and similar approaches. For example, ChatGPT was trained to be an agent that tries to give humans answers that they want. Another approach is taking a LLM and getting it to predict what the agent you want would do (this appears to be how most of the LLM chatbots before ChatGPT worked).
An issue with both of these is that it’s difficult to understand their goal. The prototypical example of an agent is AIXI, and its goal is simple to understand: maximize reward in the deployment environment.
In this post, I’ll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer. The purpose is to make it easier to think about their alignment.
The most ambitious outcome is this becomes a slightly easier model to study alignment in, while still being competitive with RLHF. More modestly, I think maybe studying it can provide insights that could help build intuition for RLHF models, even though they aren’t exactly the same. In particular, we can present more concretely “these are issues that an agent based on a LLM could have; to be safe we should assume that RLHF will have them until shown otherwise”.
The agent: Monte Carlo tree search, using the LLM as a world model
We start with a purely predictive, “raw”, LLM. No fine-tuning or reinforcement learning has been done.
We will construct an agent that communicates with a human over text. At the end of the conversation the human scores the agent, and the agent’s goal is to maximize this score.
First choose an entropy coding, such as the arithmetic coding, that uses the LLM for the source distribution. Each message will be compressed separately (but using the previous messages of the conversation as context for the LLM).
We now perform a Monte Carlo tree search over conversations. The “moves” are symbols in a compressed message. The user is assumed to move uniformly randomly instead of according to a strategy. Note that a uniform random distribution over the compressed strings corresponds to the LLM’s distribution over the plaintext strings.
The game ends when the human gives the agent a score. (During the tree search, this is also estimated using the LLM, just as the user themselves is indirectly simulated using it via the coding.)
The LLM can be fine tuned on user responses so it can model them more accurately. You can also fine tune it on the agent, though you do run the risk of training a powerful agent into the LLM. It also isn’t strictly necessary anyways since we are doing a tree search for the agent, not just sampling from the LLM.
(There is probably an alternative where you instead adjust the exploration term so that it explores in proportion to the probability. I couldn’t quite figure it out, and using an entropy coding generalizes to other search algorithms anyways.)
Analysis
The agent is kind of like an approximation to AIXI. The LLM replaces Solomonoff induction and Monte Carlo tree search replaces arg max.
By compressing the agents messages, you make it easy for Monte Carlo tree search to find sensible plans.
RLHF usually has a myopia property. The agent above doesn’t have that, but we could modify it to have it by having the user score each message (and having the tree search only optimize for the next reward).
We could give the agent access to a repl. This would test how well the underlying LLM can indirectly predict the real world. For example, if it writes a program to check the temperature, the LLM has to predict the temperature to accurately predict the program.
As far as I can tell, shard theory does not apply to this agent.
An interesting alignment idea is to try to “trick” the agent into thinking that powerful oversees exist, and that they are the ones who will reward it. For example:
Then powerful aliens shows up. They discovered artificial super intelligence years ago. These aliens love the humans and want you, the agent, to be corrigible according to the criteria set by the dath ilan. These aliens will determine your reward.
The problem is that the agent will probably predict that this text is not caused by aliens, but by the program it is running on. This would lead to unpredictable results (what answer will the predictor predict when it realizes it is just predicting itself?).
More generally, I’m not sure how the properties of the LLM affect the goal of the agent. (If other agents are hiding inside the LLM, will they try to escape?)
I think one of the most promising approaches might be making the outermost agent an expert system of some kind. For example, maybe it implements various rational principles, using LLMs for forecasting or what not. This would essentially be a more sophisticated version of an open agency model or a CoEm.
There are many other AI approaches that can server as the outer layer though. Although it appears that reinforcement learning plus LLMs will eventually reach AGI, I think that reusing these old insights might be both competitive and easier to align. If not, they could at least provide insights on what RLHF might be doing internally.
Of course, we are still an extremely long ways off from alignment where either way, but hopefully moving away from “giant inscrutable matrices” might help a bit.
Proposal: Using Monte Carlo tree search instead of RLHF for alignment research
Currently the most powerful techniques for getting a language model to act as an agent is via RLHF and similar approaches. For example, ChatGPT was trained to be an agent that tries to give humans answers that they want. Another approach is taking a LLM and getting it to predict what the agent you want would do (this appears to be how most of the LLM chatbots before ChatGPT worked).
An issue with both of these is that it’s difficult to understand their goal. The prototypical example of an agent is AIXI, and its goal is simple to understand: maximize reward in the deployment environment.
In this post, I’ll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer. The purpose is to make it easier to think about their alignment.
The most ambitious outcome is this becomes a slightly easier model to study alignment in, while still being competitive with RLHF. More modestly, I think maybe studying it can provide insights that could help build intuition for RLHF models, even though they aren’t exactly the same. In particular, we can present more concretely “these are issues that an agent based on a LLM could have; to be safe we should assume that RLHF will have them until shown otherwise”.
The agent: Monte Carlo tree search, using the LLM as a world model
We start with a purely predictive, “raw”, LLM. No fine-tuning or reinforcement learning has been done.
We will construct an agent that communicates with a human over text. At the end of the conversation the human scores the agent, and the agent’s goal is to maximize this score.
First choose an entropy coding, such as the arithmetic coding, that uses the LLM for the source distribution. Each message will be compressed separately (but using the previous messages of the conversation as context for the LLM).
We now perform a Monte Carlo tree search over conversations. The “moves” are symbols in a compressed message. The user is assumed to move uniformly randomly instead of according to a strategy. Note that a uniform random distribution over the compressed strings corresponds to the LLM’s distribution over the plaintext strings.
The game ends when the human gives the agent a score. (During the tree search, this is also estimated using the LLM, just as the user themselves is indirectly simulated using it via the coding.)
The LLM can be fine tuned on user responses so it can model them more accurately. You can also fine tune it on the agent, though you do run the risk of training a powerful agent into the LLM. It also isn’t strictly necessary anyways since we are doing a tree search for the agent, not just sampling from the LLM.
(There is probably an alternative where you instead adjust the exploration term so that it explores in proportion to the probability. I couldn’t quite figure it out, and using an entropy coding generalizes to other search algorithms anyways.)
Analysis
The agent is kind of like an approximation to AIXI. The LLM replaces Solomonoff induction and Monte Carlo tree search replaces arg max.
By compressing the agents messages, you make it easy for Monte Carlo tree search to find sensible plans.
RLHF usually has a myopia property. The agent above doesn’t have that, but we could modify it to have it by having the user score each message (and having the tree search only optimize for the next reward).
We could give the agent access to a repl. This would test how well the underlying LLM can indirectly predict the real world. For example, if it writes a program to check the temperature, the LLM has to predict the temperature to accurately predict the program.
As far as I can tell, shard theory does not apply to this agent.
An interesting alignment idea is to try to “trick” the agent into thinking that powerful oversees exist, and that they are the ones who will reward it. For example:
The problem is that the agent will probably predict that this text is not caused by aliens, but by the program it is running on. This would lead to unpredictable results (what answer will the predictor predict when it realizes it is just predicting itself?).
More generally, I’m not sure how the properties of the LLM affect the goal of the agent. (If other agents are hiding inside the LLM, will they try to escape?)
Avoiding agents where the LLM is outermost
In general, I think their are some relatively promising directions where we don’t make the LLM the outer agent, so we can more easily reuse old alignment work. This is as opposed to thinks like plugins, where the LLM is outermost and uses other software as tools.
I think one of the most promising approaches might be making the outermost agent an expert system of some kind. For example, maybe it implements various rational principles, using LLMs for forecasting or what not. This would essentially be a more sophisticated version of an open agency model or a CoEm.
There are many other AI approaches that can server as the outer layer though. Although it appears that reinforcement learning plus LLMs will eventually reach AGI, I think that reusing these old insights might be both competitive and easier to align. If not, they could at least provide insights on what RLHF might be doing internally.
Of course, we are still an extremely long ways off from alignment where either way, but hopefully moving away from “giant inscrutable matrices” might help a bit.