Think carefully before calling RL policies “agents”
I think agentic systems represent most of AI extinction risk. I want to think clearly about what training procedures produce agentic systems. Unfortunately, the field of reinforcement learning has a convention of calling its trained artifacts “agents.” This terminology is loaded and inappropriate for my purposes. I advocate instead calling the trained system a “policy.” This name is standard, accurate, and neutral.
Don’t assume the conclusion by calling a policy an “agent”
The real-world systems we want to think about and align are very large neural networks like GPT-4. These networks are trained and finetuned via different kinds of self-supervised and reinforcement learning.
When a policy network is updated using a learning process, its parameters are changed via weight updates. Eventually, the process ends (assuming no online learning for simplicity). We are then left with a policy network (e.g. GPT-4). To actually use the network, we need to use some sampling procedure on its logits (e.g. top-p with a given temperature). Once we fix the policy network and sampling procedure, we get a mapping from observations (e.g. sequences of embeddings, like those for [I
, love
, dogs
]) to probability distributions over outputs (e.g. tokens). This mapping is the policy.
I want to carefully consider whether a trained policy will exhibit agentic cognition of various forms, including planning, goal-directedness, and situational awareness. While considering this question, we should not start calling the trained policy an “agent”! That’s like a detective randomly calling one of the suspects “criminal.” I prefer just calling the trained artifact a “policy.” This neutrally describes the artifact’s function, without connoting agentic or dangerous cognition.
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present). I think it’s appropriate to call that kind of computation “agentic.” But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
There’s no deep reason why trained policies are called “agents”
Throughout my PhD in RL theory, I accepted the idea that RL tends to create agents, and supervised learning doesn’t. Well-cited papers use the term “agents”, as do textbooks and Wikipedia. I also hadn’t seen anyone give the pushback I give in this post.
Question: Given a fixed architecture (e.g. a 48-layer decoder-only transformer), what kinds of learning processes are more likely to train policy networks which use internal planning?
If you’re like I was in early 2022, you might answer “RL trains agents.” But why? In what ways do PPO’s weight updates tend to accumulate into agentic circuitry, while unsupervised pretraining on OpenWebText does not?
Claim: People are tempted to answer “RL” because the field adopted the “agent” terminology for reasons unrelated to the above question. Everyone keeps using the loaded terminology because no one questions it.
Let’s be clear. RL researchers did not deliberate carefully about the inductive biases of deep learning, and then decide that a certain family of algorithms was especially likely to train agentic cognition. Researchers called policies “agents” as early as 1995, before the era of deep learning (e.g. see AI: A modern approach, 1st edition).
Does RL actually produce agents?
Just because “agents” was chosen for reasons unrelated to agentic cognition, doesn’t mean the name is inappropriate. I can think of a few pieces of evidence for RL entraining agentic cognition.
RL methods are often used to train networks on tasks like video games and robotics. These methods are used because they work, and these tasks seem to have an “autonomous” and “action-directed” nature. This is weak evidence of RL being appropriate for producing agentic cognition. Not strong evidence.
RL allows reinforcing behavior[1] which we couldn’t have demonstrated ourselves. For example, actuating a simulated robot to perform a backflip. If we could do this ourselves and had the time to spare, we could have just provided supervised feedback. But this seems just like a question of providing training signal in more situations. Not strong evidence.
Many practical RL algorithms are on-policy, in that the policy’s current behavior affects its future training data. This may lead to policies which “chain into themselves over time.” This seems related to “nonmyopic training objectives.” I have more thoughts here, but they’re still vague and heuristic. Not strong evidence.
There’s some empirical evidence from Discovering Language Model Behaviors with Model-Written Evaluations, which I’ve only skimmed. They claim to present evidence that RLHF increases e.g. power-seeking. I might end up finding this persuasive.
There’s good evidence that humans and other animals do something akin to RL. For example, something like TD learning may be present in the brain. Since some humans are agentic sometimes, and my guess is that RL is one of the main learning processes in the brain, this is some evidence for RL producing agentic cognition.
Overall, I do lean towards “RL is a way of tying together pretrained cognition into agentic goal pursuit.” I don’t think this conclusion is slam-dunk or automatic, and don’t currently think RL is much more dangerous than other ways of computing weight updates. I’m still trying to roll back the invalid updates I made due to the RL field’s inappropriate “agents” terminology. (My current guesses here should be taken strictly separately from the main point of the post.)
Conclusions
Use neutral, non-loaded terminology like “policy” instead of “agent”, unless you have specific reason to think the policy is agentic.
Yes, it’ll be hard to kick the habit. I’ve been working on it for about a month.
Don’t wait for everyone to coordinate on saying “policy.” You can switch to “policy” right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I’ve enjoyed these benefits for a month. The switch didn’t cause communication difficulties.
Strongly downweight the memes around RL “creating agents.”
“RLHF boosts agentic cognition” seems like a contingent empirical fact, and not trivially deducible from “PPO is an RL algorithm.” Even if RLHF in fact boosts agentic cognition, you’ve probably overupdated towards this conclusion due to loaded terminology.
However, only using unsupervised pretraining doesn’t mean you’re safe. EG base GPT-5 can totally seek power, whether or not some human researchers in the 1970s decided to call their trained artifacts “agents” or not.
Thanks to Aryan Bhatt for clarifying the distinction between policies and policy networks.
Appendix: Other bad RL terminology
“Reward” (bad) → “Reinforcement” (better)
“Reward” has absurd and inappropriate pleasurable connotations which suggest that the agentpolicy will seek out this “rewarding” quantity.
I prefer “reinforcement” because it’s more accurate (at least for the policy gradient algorithms I care about) and is overall a neutral word. The cost is that “reinforcement function” is somewhat nonstandard, requiring extra explanation. I think this is often worth it in personal and blog-post communication, and maybe also in conference papers.
“Optimal policy” → “Reinforcement-maximizing policy”
Saying “optimal” makes the policy sound good and smart, and suggests that the reinforcement function is something which should be optimized over. As I discussed in a recent comment, I think that’s muddying and misleading. In my internal language, “optimal policy” translates to “reinforcement-maximizing policy.” I will probably adopt this for some communication.
- ^
Technically, we aren’t just reinforcing behavior. A policy gradient will upweight certain logits in certain situations. This parameter update generally affects the generalization properties of the network in all situations.
- And All the Shoggoths Merely Players by 10 Feb 2024 19:56 UTC; 163 points) (
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 158 points) (
- Dreams of AI alignment: The danger of suggestive names by 10 Feb 2024 1:22 UTC; 103 points) (
- “Deep Learning” Is Function Approximation by 21 Mar 2024 17:50 UTC; 98 points) (
- When is reward ever the optimization target? by 15 Oct 2024 15:09 UTC; 14 points) (
- 1 Jan 2024 23:06 UTC; 11 points) 's comment on TurnTrout’s shortform feed by (
- 12 Jun 2023 5:22 UTC; 10 points) 's comment on Simulators by (
- 12 Jun 2023 5:10 UTC; 9 points) 's comment on My thoughts on OpenAI’s alignment plan by (
- Planning in LLMs: Insights from AlphaGo by 4 Dec 2023 18:48 UTC; 8 points) (
- 30 Sep 2023 1:32 UTC; 5 points) 's comment on TurnTrout’s shortform feed by (
- 10 Sep 2024 0:28 UTC; 4 points) 's comment on Conflating value alignment and intent alignment is causing confusion by (
- 25 Jan 2025 2:54 UTC; 4 points) 's comment on Symbol/Referent Confusions in Language Model Alignment Experiments by (
This post argues that, while it’s traditional to call policies trained by RL “agents”, there is no good reason for it and the terminology does more harm than good. IMO Turner has a valid point, but he takes it too far.
What is an “agent”? Unfortunately, this question is not discussed in the OP in any detail. There are two closely related informal approaches to defining “agents” that I like, one more axiomatic / black-boxy and the other more algorithmic / white-boxy.
The algorithmic definition is: An agent is a system that can (i) learn models of its environment (ii) use learned models to generate plans towards a particular goal (iii) execute these plans.
Under this definition, is an RL policy an “agent”? Not necessarily. There is a much stronger case for arguing that the RL algorithm, including the training procedure, is an agent. Indeed, such an algorithm (i) learns a model of the environment (at least if it’s model-based RL: if it’s model-free it might still do so implicitly, but it’s less clear) (ii) generates a plan (the policy) (iii) executes the plans (when the policy is executed, i.e. in inference/deployment time). Whether the policy in itself is an agent amounts to asking whether the policy is capable of in-context RL (which is far from obvious). Moreover, the case for calling the system an agent is stronger when it learns online and weaker (but not completely gone) when there is a separation into non-overlapping training and deployment phases, as often done in contemporary systems.
The axiomatic definition is: An agent is a system that effectively pursues a particular goal in an unknown environment. That is, it needs to perform well (as measured by achieving the goal) when placed in a large variety of different environments.
With this definition we reach similar conclusions. An online RL system would arguably adapt to its environment and optimize towards achieving the goal (which is maximizing the reward). A trained policy will not necessarily do it: if it was trained in a particular environment, it can become completely ineffective in other environments!
Importantly, even an online RL system can easily fail at agentic-ness, depending how good its learning algorithm is for dealing with distributional shift, nonrealizability etc. Nevertheless, the relation between agency and RL is pretty direct, more so than the OP implies.