I think agentic systems represent most of AI extinction risk. I want to think clearly about what training procedures produce agentic systems. Unfortunately, the field of reinforcement learning has a convention of calling its trained artifacts “agents.” This terminology is loaded and inappropriate for my purposes. I advocate instead calling the trained system a “policy.” This name is standard, accurate, and neutral.
Don’t assume the conclusion by calling a policy an “agent”
The real-world systems we want to think about and align are very large neural networks like GPT-4. These networks are trained and finetuned via different kinds of self-supervised and reinforcement learning.
When a policy network is updated using a learning process, its parameters are changed via weight updates. Eventually, the process ends (assuming no online learning for simplicity). We are then left with a policy network (e.g. GPT-4). To actually use the network, we need to use some sampling procedure on its logits (e.g. top-p with a given temperature). Once we fix the policy network and sampling procedure, we get a mapping from observations (e.g. sequences of embeddings, like those for [I, love, dogs]) to probability distributions over outputs (e.g. tokens). This mapping π is the policy.
I want to carefully consider whether a trained policy will exhibit agentic cognition of various forms, including planning, goal-directedness, and situational awareness. While considering this question, we should not start calling the trained policy an “agent”! That’s like a detective randomly calling one of the suspects “criminal.” I prefer just calling the trained artifact a “policy.” This neutrally describes the artifact’s function, without connoting agentic or dangerous cognition.
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present). I think it’s appropriate to call that kind of computation “agentic.” But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
There’s no deep reason why trained policies are called “agents”
Throughout my PhD in RL theory, I accepted the idea that RL tends to create agents, and supervised learning doesn’t. Well-citedpapers use the term “agents”, as do textbooks and Wikipedia. I also hadn’t seen anyone give the pushback I give in this post.
Question: Given a fixed architecture (e.g. a 48-layer decoder-only transformer), what kinds of learning processes are more likely to train policy networks which use internal planning?
If you’re like I was in early 2022, you might answer “RL trains agents.” But why? In what ways do PPO’s weight updates tend to accumulate into agentic circuitry, while unsupervised pretraining on OpenWebText does not?
Claim: People are tempted to answer “RL” because the field adopted the “agent” terminology for reasons unrelated to the above question. Everyone keeps using the loaded terminology because no one questions it.
Let’s be clear. RL researchers did not deliberate carefully about the inductive biases of deep learning, and then decide that a certain family of algorithms was especially likely to train agentic cognition. Researchers called policies “agents” as early as 1995, before the era of deep learning (e.g. see AI: A modern approach, 1st edition).
Does RL actually produce agents?
Just because “agents” was chosen for reasons unrelated to agentic cognition, doesn’t mean the name is inappropriate. I can think of a few pieces of evidence for RL entraining agentic cognition.
RL methods are often used to train networks on tasks like video games and robotics. These methods are used because they work, and these tasks seem to have an “autonomous” and “action-directed” nature. This is weak evidence of RL being appropriate for producing agentic cognition. Not strong evidence.
RL allows reinforcing behavior[1] which we couldn’t have demonstrated ourselves. For example, actuating a simulated robot to perform a backflip. If we could do this ourselves and had the time to spare, we could have just provided supervised feedback. But this seems just like a question of providing training signal in more situations. Not strong evidence.
Many practical RL algorithms are on-policy, in that the policy’s current behavior affects its future training data. This may lead to policies which “chain into themselves over time.” This seems related to “nonmyopic training objectives.” I have more thoughts here, but they’re still vague and heuristic. Not strong evidence.
There’s good evidence that humans and other animals do something akin to RL. For example, something like TD learning may be present in the brain. Since some humans are agentic sometimes, and my guess is that RL is one of the main learning processes in the brain, this is some evidence for RL producing agentic cognition.
Overall, I do lean towards “RL is a way of tying together pretrained cognition into agentic goal pursuit.” I don’t think this conclusion is slam-dunk or automatic, and don’t currently think RL is much more dangerous than other ways of computing weight updates. I’m still trying to roll back the invalid updates I made due to the RL field’s inappropriate “agents” terminology. (My current guesses here should be taken strictly separately from the main point of the post.)
Conclusions
Use neutral, non-loaded terminology like “policy” instead of “agent”, unless you have specific reason to think the policy is agentic.
Yes, it’ll be hard to kick the habit. I’ve been working on it for about a month.
Don’t wait for everyone to coordinate on saying “policy.” You can switch to “policy” right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I’ve enjoyed these benefits for a month. The switch didn’t cause communication difficulties.
Strongly downweight the memes around RL “creating agents.”
“RLHF boosts agentic cognition” seems like a contingent empirical fact, and not trivially deducible from “PPO is an RL algorithm.” Even if RLHF in fact boosts agentic cognition, you’ve probably overupdated towards this conclusion due to loaded terminology.
However, only using unsupervised pretraining doesn’t mean you’re safe. EG base GPT-5 can totally seek power, whether or not some human researchers in the 1970s decided to call their trained artifacts “agents” or not.
Thanks to Aryan Bhatt for clarifying the distinction between policies and policy networks.
Appendix: Other bad RL terminology
“Reward” (bad) → “Reinforcement” (better)
“Reward” has absurd and inappropriate pleasurable connotations which suggest that the agentpolicy will seek out this “rewarding” quantity.
I prefer “reinforcement” because it’s more accurate (at least for the policy gradient algorithms I care about) and is overall a neutral word. The cost is that “reinforcement function” is somewhat nonstandard, requiring extra explanation. I think this is often worth it in personal and blog-post communication, and maybe also in conference papers.
Saying “optimal” makes the policy sound good and smart, and suggests that the reinforcement function is something which should be optimized over. As I discussed in a recent comment, I think that’s muddying and misleading. In my internal language, “optimal policy” translates to “reinforcement-maximizing policy.” I will probably adopt this for some communication.
Technically, we aren’t just reinforcing behavior. A policy gradient will upweight certain logits in certain situations. This parameter update generally affects the generalization properties of the network in all situations.
Think carefully before calling RL policies “agents”
I think agentic systems represent most of AI extinction risk. I want to think clearly about what training procedures produce agentic systems. Unfortunately, the field of reinforcement learning has a convention of calling its trained artifacts “agents.” This terminology is loaded and inappropriate for my purposes. I advocate instead calling the trained system a “policy.” This name is standard, accurate, and neutral.
Don’t assume the conclusion by calling a policy an “agent”
The real-world systems we want to think about and align are very large neural networks like GPT-4. These networks are trained and finetuned via different kinds of self-supervised and reinforcement learning.
When a policy network is updated using a learning process, its parameters are changed via weight updates. Eventually, the process ends (assuming no online learning for simplicity). We are then left with a policy network (e.g. GPT-4). To actually use the network, we need to use some sampling procedure on its logits (e.g. top-p with a given temperature). Once we fix the policy network and sampling procedure, we get a mapping from observations (e.g. sequences of embeddings, like those for [
I
,love
,dogs
]) to probability distributions over outputs (e.g. tokens). This mapping π is the policy.I want to carefully consider whether a trained policy will exhibit agentic cognition of various forms, including planning, goal-directedness, and situational awareness. While considering this question, we should not start calling the trained policy an “agent”! That’s like a detective randomly calling one of the suspects “criminal.” I prefer just calling the trained artifact a “policy.” This neutrally describes the artifact’s function, without connoting agentic or dangerous cognition.
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present). I think it’s appropriate to call that kind of computation “agentic.” But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
There’s no deep reason why trained policies are called “agents”
Throughout my PhD in RL theory, I accepted the idea that RL tends to create agents, and supervised learning doesn’t. Well-cited papers use the term “agents”, as do textbooks and Wikipedia. I also hadn’t seen anyone give the pushback I give in this post.
Question: Given a fixed architecture (e.g. a 48-layer decoder-only transformer), what kinds of learning processes are more likely to train policy networks which use internal planning?
If you’re like I was in early 2022, you might answer “RL trains agents.” But why? In what ways do PPO’s weight updates tend to accumulate into agentic circuitry, while unsupervised pretraining on OpenWebText does not?
Claim: People are tempted to answer “RL” because the field adopted the “agent” terminology for reasons unrelated to the above question. Everyone keeps using the loaded terminology because no one questions it.
Let’s be clear. RL researchers did not deliberate carefully about the inductive biases of deep learning, and then decide that a certain family of algorithms was especially likely to train agentic cognition. Researchers called policies “agents” as early as 1995, before the era of deep learning (e.g. see AI: A modern approach, 1st edition).
Does RL actually produce agents?
Just because “agents” was chosen for reasons unrelated to agentic cognition, doesn’t mean the name is inappropriate. I can think of a few pieces of evidence for RL entraining agentic cognition.
RL methods are often used to train networks on tasks like video games and robotics. These methods are used because they work, and these tasks seem to have an “autonomous” and “action-directed” nature. This is weak evidence of RL being appropriate for producing agentic cognition. Not strong evidence.
RL allows reinforcing behavior[1] which we couldn’t have demonstrated ourselves. For example, actuating a simulated robot to perform a backflip. If we could do this ourselves and had the time to spare, we could have just provided supervised feedback. But this seems just like a question of providing training signal in more situations. Not strong evidence.
Many practical RL algorithms are on-policy, in that the policy’s current behavior affects its future training data. This may lead to policies which “chain into themselves over time.” This seems related to “nonmyopic training objectives.” I have more thoughts here, but they’re still vague and heuristic. Not strong evidence.
There’s some empirical evidence from Discovering Language Model Behaviors with Model-Written Evaluations, which I’ve only skimmed. They claim to present evidence that RLHF increases e.g. power-seeking. I might end up finding this persuasive.
There’s good evidence that humans and other animals do something akin to RL. For example, something like TD learning may be present in the brain. Since some humans are agentic sometimes, and my guess is that RL is one of the main learning processes in the brain, this is some evidence for RL producing agentic cognition.
Overall, I do lean towards “RL is a way of tying together pretrained cognition into agentic goal pursuit.” I don’t think this conclusion is slam-dunk or automatic, and don’t currently think RL is much more dangerous than other ways of computing weight updates. I’m still trying to roll back the invalid updates I made due to the RL field’s inappropriate “agents” terminology. (My current guesses here should be taken strictly separately from the main point of the post.)
Conclusions
Use neutral, non-loaded terminology like “policy” instead of “agent”, unless you have specific reason to think the policy is agentic.
Yes, it’ll be hard to kick the habit. I’ve been working on it for about a month.
Don’t wait for everyone to coordinate on saying “policy.” You can switch to “policy” right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I’ve enjoyed these benefits for a month. The switch didn’t cause communication difficulties.
Strongly downweight the memes around RL “creating agents.”
“RLHF boosts agentic cognition” seems like a contingent empirical fact, and not trivially deducible from “PPO is an RL algorithm.” Even if RLHF in fact boosts agentic cognition, you’ve probably overupdated towards this conclusion due to loaded terminology.
However, only using unsupervised pretraining doesn’t mean you’re safe. EG base GPT-5 can totally seek power, whether or not some human researchers in the 1970s decided to call their trained artifacts “agents” or not.
Thanks to Aryan Bhatt for clarifying the distinction between policies and policy networks.
Appendix: Other bad RL terminology
“Reward” (bad) → “Reinforcement” (better)
“Reward” has absurd and inappropriate pleasurable connotations which suggest that the
agentpolicy will seek out this “rewarding” quantity.I prefer “reinforcement” because it’s more accurate (at least for the policy gradient algorithms I care about) and is overall a neutral word. The cost is that “reinforcement function” is somewhat nonstandard, requiring extra explanation. I think this is often worth it in personal and blog-post communication, and maybe also in conference papers.
“Optimal policy” → “Reinforcement-maximizing policy”
Saying “optimal” makes the policy sound good and smart, and suggests that the reinforcement function is something which should be optimized over. As I discussed in a recent comment, I think that’s muddying and misleading. In my internal language, “optimal policy” translates to “reinforcement-maximizing policy.” I will probably adopt this for some communication.
Technically, we aren’t just reinforcing behavior. A policy gradient will upweight certain logits in certain situations. This parameter update generally affects the generalization properties of the network in all situations.