Think carefully before calling RL policies “agents”
I think agentic systems represent most of AI extinction risk. I want to think clearly about what training procedures produce agentic systems. Unfortunately, the field of reinforcement learning has a convention of calling its trained artifacts “agents.” This terminology is loaded and inappropriate for my purposes. I advocate instead calling the trained system a “policy.” This name is standard, accurate, and neutral.
Don’t assume the conclusion by calling a policy an “agent”
The real-world systems we want to think about and align are very large neural networks like GPT-4. These networks are trained and finetuned via different kinds of self-supervised and reinforcement learning.
When a policy network is updated using a learning process, its parameters are changed via weight updates. Eventually, the process ends (assuming no online learning for simplicity). We are then left with a policy network (e.g. GPT-4). To actually use the network, we need to use some sampling procedure on its logits (e.g. top-p with a given temperature). Once we fix the policy network and sampling procedure, we get a mapping from observations (e.g. sequences of embeddings, like those for [I
, love
, dogs
]) to probability distributions over outputs (e.g. tokens). This mapping is the policy.
I want to carefully consider whether a trained policy will exhibit agentic cognition of various forms, including planning, goal-directedness, and situational awareness. While considering this question, we should not start calling the trained policy an “agent”! That’s like a detective randomly calling one of the suspects “criminal.” I prefer just calling the trained artifact a “policy.” This neutrally describes the artifact’s function, without connoting agentic or dangerous cognition.
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present). I think it’s appropriate to call that kind of computation “agentic.” But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
There’s no deep reason why trained policies are called “agents”
Throughout my PhD in RL theory, I accepted the idea that RL tends to create agents, and supervised learning doesn’t. Well-cited papers use the term “agents”, as do textbooks and Wikipedia. I also hadn’t seen anyone give the pushback I give in this post.
Question: Given a fixed architecture (e.g. a 48-layer decoder-only transformer), what kinds of learning processes are more likely to train policy networks which use internal planning?
If you’re like I was in early 2022, you might answer “RL trains agents.” But why? In what ways do PPO’s weight updates tend to accumulate into agentic circuitry, while unsupervised pretraining on OpenWebText does not?
Claim: People are tempted to answer “RL” because the field adopted the “agent” terminology for reasons unrelated to the above question. Everyone keeps using the loaded terminology because no one questions it.
Let’s be clear. RL researchers did not deliberate carefully about the inductive biases of deep learning, and then decide that a certain family of algorithms was especially likely to train agentic cognition. Researchers called policies “agents” as early as 1995, before the era of deep learning (e.g. see AI: A modern approach, 1st edition).
Does RL actually produce agents?
Just because “agents” was chosen for reasons unrelated to agentic cognition, doesn’t mean the name is inappropriate. I can think of a few pieces of evidence for RL entraining agentic cognition.
RL methods are often used to train networks on tasks like video games and robotics. These methods are used because they work, and these tasks seem to have an “autonomous” and “action-directed” nature. This is weak evidence of RL being appropriate for producing agentic cognition. Not strong evidence.
RL allows reinforcing behavior[1] which we couldn’t have demonstrated ourselves. For example, actuating a simulated robot to perform a backflip. If we could do this ourselves and had the time to spare, we could have just provided supervised feedback. But this seems just like a question of providing training signal in more situations. Not strong evidence.
Many practical RL algorithms are on-policy, in that the policy’s current behavior affects its future training data. This may lead to policies which “chain into themselves over time.” This seems related to “nonmyopic training objectives.” I have more thoughts here, but they’re still vague and heuristic. Not strong evidence.
There’s some empirical evidence from Discovering Language Model Behaviors with Model-Written Evaluations, which I’ve only skimmed. They claim to present evidence that RLHF increases e.g. power-seeking. I might end up finding this persuasive.
There’s good evidence that humans and other animals do something akin to RL. For example, something like TD learning may be present in the brain. Since some humans are agentic sometimes, and my guess is that RL is one of the main learning processes in the brain, this is some evidence for RL producing agentic cognition.
Overall, I do lean towards “RL is a way of tying together pretrained cognition into agentic goal pursuit.” I don’t think this conclusion is slam-dunk or automatic, and don’t currently think RL is much more dangerous than other ways of computing weight updates. I’m still trying to roll back the invalid updates I made due to the RL field’s inappropriate “agents” terminology. (My current guesses here should be taken strictly separately from the main point of the post.)
Conclusions
Use neutral, non-loaded terminology like “policy” instead of “agent”, unless you have specific reason to think the policy is agentic.
Yes, it’ll be hard to kick the habit. I’ve been working on it for about a month.
Don’t wait for everyone to coordinate on saying “policy.” You can switch to “policy” right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I’ve enjoyed these benefits for a month. The switch didn’t cause communication difficulties.
Strongly downweight the memes around RL “creating agents.”
“RLHF boosts agentic cognition” seems like a contingent empirical fact, and not trivially deducible from “PPO is an RL algorithm.” Even if RLHF in fact boosts agentic cognition, you’ve probably overupdated towards this conclusion due to loaded terminology.
However, only using unsupervised pretraining doesn’t mean you’re safe. EG base GPT-5 can totally seek power, whether or not some human researchers in the 1970s decided to call their trained artifacts “agents” or not.
Thanks to Aryan Bhatt for clarifying the distinction between policies and policy networks.
Appendix: Other bad RL terminology
“Reward” (bad) → “Reinforcement” (better)
“Reward” has absurd and inappropriate pleasurable connotations which suggest that the agentpolicy will seek out this “rewarding” quantity.
I prefer “reinforcement” because it’s more accurate (at least for the policy gradient algorithms I care about) and is overall a neutral word. The cost is that “reinforcement function” is somewhat nonstandard, requiring extra explanation. I think this is often worth it in personal and blog-post communication, and maybe also in conference papers.
“Optimal policy” → “Reinforcement-maximizing policy”
Saying “optimal” makes the policy sound good and smart, and suggests that the reinforcement function is something which should be optimized over. As I discussed in a recent comment, I think that’s muddying and misleading. In my internal language, “optimal policy” translates to “reinforcement-maximizing policy.” I will probably adopt this for some communication.
- ^
Technically, we aren’t just reinforcing behavior. A policy gradient will upweight certain logits in certain situations. This parameter update generally affects the generalization properties of the network in all situations.
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 167 points) (
- And All the Shoggoths Merely Players by 10 Feb 2024 19:56 UTC; 160 points) (
- Dreams of AI alignment: The danger of suggestive names by 10 Feb 2024 1:22 UTC; 103 points) (
- “Deep Learning” Is Function Approximation by 21 Mar 2024 17:50 UTC; 98 points) (
- When is reward ever the optimization target? by 15 Oct 2024 15:09 UTC; 14 points) (
- 12 Jun 2023 5:10 UTC; 9 points) 's comment on My thoughts on OpenAI’s alignment plan by (
- Planning in LLMs: Insights from AlphaGo by 4 Dec 2023 18:48 UTC; 8 points) (
- 1 Jan 2024 19:55 UTC; 6 points) 's comment on Reward is not the optimization target by (
- 30 Sep 2023 1:32 UTC; 5 points) 's comment on TurnTrout’s shortform feed by (
I’m generally on board with attempts to have more precise options for referring to these concepts, and in this context I agree that policy as a term is more appropriate and that gradients from RL training don’t magically include more agent juice.
That said, I do think there is an important distinction between the tendencies of systems built with RL versus supervised learning that arises from reward sparsity.
In traditional RL, individual policy outputs aren’t judged in as much detail as in supervised learning. Even when comparing against RL with reward shaping, it is still likely going to be far less densely defined and constrained than, say, per-output predictive loss.
Since the target is smaller and more distant, traditional RL gives the optimizer more room to roam. I think it’s correct to say that most RL implementations will have a lot of reactive bits and pieces that are selected to form the final policy, but because learning instrumental behavior is effectively required for traditional RL to get anywhere at all, it’s more likely (than in predictive loss) that nonmyopic internal goal-like representations will be learned as a part of those instrumental behaviors.
Training on purely predictive loss, in contrast, is both densely informative and extremely constraining. Goals are less obviously convergently useful, and any internal goal representations that are learned need to fit within the bounds enforced by the predictive loss and should tend to be more local in nature as a result. Learned values that overstep their narrowly-defined usefulness get directly slapped by other predictive samples.
I think the greater freedom RL training tends to have, and the greater tendency to learn more broadly applicable internal goals to drive the required instrumental behavior, do make RL-trained systems feel more “agentic” even if it is not absolutely fundamental to the training process, nor even really related to the model’s coherence.
How do you think “agent” should be defined?
According to the LessWrong concepts page for agents, an agent is an entity that perceives its environment and takes actions to maximize its utility.
I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.
I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
The word “agent” invokes a bundle of intuitions / associations, and you think many of those are misleading in general. So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.
Neither option is great; this is obviously a judgment call.
For my part, I think that if I say:
“An RL agent isn’t necessarily planning ahead towards goals, in many cases it’s better to think of it as a bundle of situation-dependent reactions…”
…then that strikes me as a normal kind of thing to say as part of a healthy & productive conversation.
So maybe I see pushing-back-on-the-intuitions-while-keeping-the-word as a more viable approach than you do.
(And separately, I see editing widely-used terminology as a very very big cost, probably moreso than you.)
Ditto for “reward”.
this kinda sounds slightly weird in my mind because I seem to be intuitively associating “reinforcement” with “updates” and the policy in question is a fixed-point that stops getting updated altogether.
You mention that this is off-topic so maybe you don’t want to discuss it, but I probably disagree with that—with the caveat that it’s very difficult to do an other-things-equal comparison. (I.e., we’re presumably interested in RL-safety-versus-SSL-safety holding capabilities fixed, but switching from RL to SSL does have an effect on capabilities.)
Then later you say “only using unsupervised pretraining doesn’t mean you’re safe” which is a much weaker statement, and I agree with it.
In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a “policy”, one separate from the “reactive policy” defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define “policy” similarly:
(emphasis mine) along with this later description of planning in a model-based RL context:
which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).
That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term “policy” in this way. So maybe this usage is less widespread than I had assumed.
Interesting, thanks!
I think there’s a way better third alternative: asking each reader to unilaterally switch to “policy.” No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don’t see a case for using “agent” in the mentioned cases.
I added to the post:
How about go the other direction and call systems that have agency “wanters”? It’s what I’ve been using and it seems intuitive enough that I expect anyone technical or not who hears it to understand what I mean in a general sense, so that when I explain agency in detail the concept was already correctly bound to the word.
I like that. Agency on LW usually means « the kind of behavior where you might hide your intentions to better fight anything that could stand in your way ». Calling that « wanters » or « strategical wanters » would help avoid confusion with the technical and philosophical meanings.
Well, but actually I think the other philosophical meanings are mostly the same thing, so maybe the split is too intense. I do just mean “agency” in the philosophical sense I know it.
I beg to differ, but that’s exactly why I liked your suggestion.
Once you have a policy network and a sampling procedure, you can embody it in a system which samples the network repeatedly, and hooks up the I/O to the proper environment and actuators. Usually this involves hooking the policy into a simulation of a game environment (e.g. in a Gym), but sometimes the embodiment is an actual robot in the real world.
I think using the term “agent” for the policy itself is actually a type error, and not just misleading. I think using the term to refer to the embodied system has the correct type signature, but I agree it can be misleading, for the reasons you describe.
OTOH, I do think modelling the outward behavior of such systems by regarding them as agents with black-box internals is often useful as a predictor, and I would guess that this modelling is the origin of the use of the term in RL.
But modelling outward behavior is very different from attributing that behavior to agentic cognition within the policy itself. I think it is unlikely that any current policy networks are doing (much) agentic cognition at runtime, but I wouldn’t necessarily count on that trend continuing. So moving away from the term “agent” proactively seems like a good idea.
Anyway, I appreciate posts like this which clarify / improve standard terminology. Curious if you agree with my distinction about embodiment, and if so, if you have any better suggested term for the embodied system than “agent” or “embodiment”.
Claim: The embodied system is still not necessarily an agent, and may in failure cases not have the agency one expects it to. Any representation of what agency is needs to separate successful agency from system that is claimed to have it.
Core reason: Agency is a property of pulling the future back in time; it’s when a system selects actions by conditioning on the future. Agency is when any object, even ones not structured like traditional agents, takes the shape of the future before the future does and thereby steers the future.
How I came to believe this confidently: this paper, which you have probably seen but I link as pdf for reasons; anyone reading this who hasn’t seen it, I’d very strongly encourage at least skimming it. If by chance you haven’t already read it in detail, my recommended reading order if you have 20 minutes and already understand SCMs would be {1. intro} → {appendix B.} → {1.1 example, 1.2 other characterizations, 1.3 what do we consider} → skim/quick-index/first-pass {2. background, 3. algorithms, 3.1 MSCM, 3.2 labeled MCG} → read and ponder 3.3 & 3.4 and algorithms 1 and 2, then skim through assumptions in 3.5 and read algorithm 3. If you really want to get into it you can then do several more passes to properly understand the algorithms.
This took me several days with multiple calls with friends, as I was new to SCMs. I’m abbreviating things so there isn’t an easy gloss of what I’m referring to without reading the paper; I can’t summarize precisely so I’m choosing to not summarize at all. Hopefully this isn’t new to @Max H, but on the off chance it is, this is my reply to describe why I disagree.
Hadn’t seen the paper, but I think I basically agree with it, and your claim.
I was mainly saying something even weaker: the policy itself is just a function, so it can’t be an agent. The thing that might or might not be an agent is an embodiment of the policy by repeatedly executing it in the appropriate environment, while hooked up to (real or simulated) I/O channels.
Interesting distinction. An agent that is asleep isn’t an agent, by this usage.
By the way, are you Max H of the space rock ai thingy?
Also, I didn’t mean for this distinction to be particularly interesting—I am still slightly concerned that it is so pedantic / boring / obvious that I’m the only one who finds it worth distinguishing at all.
I’m literally just saying, a description of a function / mind / algorithm is a different kind of thing than the (possibly repeated) execution of that function / mind / algorithm on some substrate. If that sounds like a really deep or interesting point, I’m probably still being misunderstood.
Well, a sleeping person is still an embodied system, with running processes and sensors that can wake the agent up. And the agent, before falling asleep, might arrange things such that they are deliberately woken up in the future under certain circumstances (e.g. setting an alarm, arranging a guard to watch over them during their sleep).
The thing I’m saying that is not an agent is more like, a static description of a mind. e.g. the source code of an AGI isn’t an agent until it is compiled and executed on some kind of substrate. I’m not a carbon (or silicon) chauvinist; I’m not picky about which substrate. But without some kind of embodiment and execution, you just have a mathematical description of a computation, the actual execution of which may or may not be computable or otherwise physically realizable within our universe.
Nope, different person!
okay, perhaps sleep doesn’t cut it. I was calling the unrun policy a sleeping ai, but perhaps suspended or stopped might be better words to generalize the unrun state of a system that would be agentic when you type
python inference.py
and hit enter on your commandline.I think the embodiment distinction is interesting and hadn’t thought of it before (note that I didn’t understand your point until reading the replies to your comment). I’m not yet sure if I find this distinction worth making, though. I’d refer to the embodied system as a “trained system” or—after reading your suggestion—an “embodiment.” Neither feels quite right to me, though.
I was just trying to replace “reward” by “reinforcement”, but hit the problem that “negative reward” makes sense, but behaviorist terminology is such that “reinforcement” is always after a good thing happens, including “negative reinforcement”, which would be a kind of positive reward that entails removing something aversive. The behaviorists use the word “punishment” for “negative reward”. But “punishment” has all the same downsides as “reward”, so I assume you’re also opposed to that. Unfortunately, if I avoid both “punishment” and “reward”, then it seems I have no way to unambiguously express the concept “negative reward”.
So “negative reward” it is. ¯\_(ツ)_/¯
Yeah, seems tough to avoid “reward” in that situation. Thanks for pointing this out.
Counterpoint: this is needlessly pedantic and a losing fight.
My understanding of the core argument is that “agent” in alignment/safety literature has a slightly different meaning than “agent” in RL. It might be the case that the difference turns out to be important, but there’s still some connection between the two meanings.
I’m not going to argue that RL inherently creates “agentic” systems in the alignment sense. I suspect there’s at least a strong correlation there (i.e. an RL-trained agent will typically create an agentic system), but that’s honestly beside the point.
The term “RL agent” is very well entrenched and de facto a correct technical term for that part of the RL formalism. Just because alignment people use that term differently, doesn’t justify going into neighboring fields and demanding them to change their ways.
It’s kinda like telling biologists that they shouldn’t use the word [matrix](https://en.wikipedia.org/wiki/Matrix_(biology)) because actual matrices are arrays of numbers (or linear maps whatever, mathematicians don’t @ me)
And finally, as an example why even if I drank the kool-aid, I absolutely couldn’t do the switch you’re recommending—what about multiagent RL? Especially one with homogeneous agents. Doing s/agent/policy/g won’t work, because a multiagent algorithm doesn’t have to be multipolicy.
The appendix on s/reward/reinforcement/g is even more silly in my opinion. RL agents (heh) are designed to seek out the reward. They might fail, but that’s the overarching goal.
I’m… not demanding that the field of RL change? Where in the post did you perceive me to demand this? For example, I wrote that “I wouldn’t say ‘reinforcement function’ in e.g. a conference paper.” I also took care to write “This terminology is loaded and inappropriate for my purposes.”
Each individual reader can choose to swap to “policy” without communication difficulties, in my experience:
(As an aside, I also separately wish RL would change its terminology, but it’s a losing fight as you point out, and I have better things to do with my time.)
I came across this in Ng and Russell (2000) yesterday, and searching for it I see it’s reasonably common. You could probably get away with it.
I’m very glad to have read this post and “Reward is not the optimization target”. I hope you continue to write “How not to think about [thing] posts”, as they have me nailed. Strong upvote.
Strong agree with the need for nuance. ‘Model’ is another word that gets horribly mangled a lot recently.
I think the more sensible uses of the word ‘agent’ I’ve come across are usually referring to the assemblage of a policy-under-training plus the rest of the shebang: learning method, exploration tricks of one kind or another, environment modelling (if any), planning algorithm (if any) etc. This seems more legit to me, though I still avoid using the word ‘agent’ as far as possible for similar reasons (discussed here (footnote 6) and here).
Similarly to Daniel’s response to ‘reward is not the optimization target’ I think you can be more generous in your interpretation of RL experts’ words and read less error in. That doesn’t mean that more care in communication and terminology would be preferable, which is a takeaway I strongly endorse.
What other, more favorable interpretations might I consider?
Oh, I mean to refer to the rest of the comment
and taking that sort of reading as a kind of innocent until proven guilty.
I’ll confess I was in a meeting yesterday and someone (a PhD student) made the obvious error of considering RL prerequisite to agentiness, perhaps (but not definitely) a consequence of exactly the conflation you’re referring to in this post. Several people in the room were able to clarify. The context was a crossover between a DL lab (this mentioned PhD student’s) and the safety research community in Oxford (me et al).
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Given that Direct Preference Optimization (DPO) seems to work pretty well and has the same global optimizer as the RLHF objective, I would be surprised if it doesn’t shape agency in a similar way to RLHF. Since DPO is not considered reinforcement learning, this would be more evidence that RL isn’t uniquely suited to produce agents or increase power-seeking.
Depending on the sampling process you use, I think you should consider this the same as RL.
I’m not sure what you mean, in DPO you never sample from the language model. You only need the probabilities of the model producing the preference data, there isn’t any exploration.
Doing multiple rounds of DPO where you sample from the LLM to get comparison pairs seems totally possible and might be the best way to use DPO in many cases.
You can of course use DPO on data obtained from sources other than the LLM itself.
Interesting. I’m thinking that with “many cases” you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
I intended this.
This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.
Humans are normally agentic (sadly they can also quite often be selfish, power-seeking, deceitful, bad-tempered, untrustworthy, and/or generally unaligned). Standard unsupervised LLM foundation model training teaches LLMs how to emulated humans as text-generation processes. This will inevitably include modelling many aspects of human psychology, including the agentic ones, and the unsavory ones. So LLMs have trained-in agentic behavior before any RL is applied, or even if you use entirely non-RL means to attempt to make them helpful/honest/harmless (e.g. how Google did this to LaMDA). They have been trained on a great many examples of deceit, power-seeking, and every other kind of nasty human behavior, so RL is not the primary source of the problem.
The alignment problem is about producing something that we are significantly more certain is aligned than a typical randomly-selected human. Handing a randomly-selected human absolute power over all of society is unlikely to end well. What we need to train is a selfless altruist who (platonically or parentally) loves all humanity. For lack of better terminology: we need to create a saint or an angel.