The argument above isn’t clear to me, because I’m not sure how you’re defining your terms.
I should note that, contrary to the statement “reward is _not_, in general, that-which-is-optimized by RL agents”, by definition “reward _must be_ what is optimized for by RL agents.” If they do not do that, they are not RL agents. At least, that is true based on the way the term “reward” is commonly used in the field of RL. That is what RL agents are programmed by humans to do. They do that by changing their behavior over many trials, and testing the results of that behavioral change on the reward signals they receive.
The only case where that is not true is one where you define “reward” in some other way than Sutton, whom you quote. I would be curious to hear how you define reward. If you redefine it, that should be done explicitly, and contrasted to the pre-existing definition so that people can accurately interpret what you’ve written.
I don’t pretend to be the authority on RL, but I have a decent understanding of the basic RL loop by which an agent sends actions into an environment, and receives rewards and state updates from that environment. Here’s my understanding of commonly used deep RL algorithms, which I’ll refer to as standard RL:
First, it’s useful to make a few key distinctions, namely between:
- the reward signal (the quantum of reward that is actually allotted to the agent step by step);
- the reward function (also known as the objective function, which is the formula by which we decide how much reward to allot in response to an RL agent’s actions and accomplishments, and when);
- the environment in which an RL agent operates (the complex system that is altered by the agent’s actions, which includes the states through which the agent moves, the rules of state transitions, and the actions available in any state);
- the human programmer’s goals (ie the thing I the programmer want an agent to achieve that may be imperfectly expressed in the reward function I write for it), since the divergence between human wishes and the express incentives given to agents seems missing in this discussion.
An RL agent chooses from a set of actions it can take in any possible state. For any RL agent, taking actions over a series of states leads to some reward or succession of rewards, even if they amount to 0 (ie, the agent failed). The rewards can be disbursed according to the end state that the agent reaches during its run, or as the agent progresses through the run.
Example: I can define an objective function to reward the agent 10 points when it reaches a goal that I have decided is desirable (eg arrival at gramma’s house), or I can award the agent at each step that it comes progressively closer to gramma’s house. (The latter objective function can be much more effective, because it sends reward signals more frequently to the agent, thus allowing it to learn more quickly. These rewards are known as dense. Holding all rewards until an agent reaches a distant end state is often called a sparse reward function. Sparse rewards make it harder for an agent to learn.) Choosing between reward functions — ie rewriting the objective function — is known as reward shaping. Learning the right way to shape rewards is an iterative process for the people creating RL agents. That is, they write a reward function and check to see whether it leads an RL agent to exhibit the right behavior. (Feel free to get meta about that…)
This notion of the “right behavior” leads me to another point. The people creating RL agents have an idea of what they would like the agents to do. They attempt to express their wishes in mathematical terms with a reward function. Sometimes, the reward function does not incentivize agents in the way the programmer wants. (A similar situation is found in software programming more generally: the computer will do precisely what you tell it to, but not necessarily what you want. That is, there is often a difference between what we want to say and what actually comes out of our mouths; between what we hope the computer will do and what we told it to do with code; between what we want the RL agent to achieve, and what our rewards will lead it to do.) In other words, when great precision is required, it is easy to give an RL agent perverse incentives by accident. This notion of perverse incentives, familiar to any one working in a large institution, will hopefully serve as a useful analogy for the ways human programmers fail to properly reward RL agents via the objective functions they write. Nonetheless, even if a reward function is poorly written, the agent strives to optimize for those rewards.
I’m not sure what you mean by “objective target”, but I’ll assume here that the objective function/reward function is the explicit definition of the objective target. If an RL agent does not achieve the objective target, there are a couple of ways to troubleshoot the agent’s poor performance.
1) Maybe you wrote the objective function wrong; that is, maybe you were rewarding behavior that will not lead the RL agent to succeed in the terms you imagined. Naive example: You want an RL agent to make its way through a difficult maze. Your reward function linearly allocates rewards to the agent the closer it comes to the exit to the maze (eg 1 step closer, 1 more point). The maze includes several dead ends that terminate one inch from the exit. The agent learns to take the turns that lead it to the end of those impasses. Solution: increase rewards exponentially as the agent nears the exit, with an additional dollop of reward when it exits. In this way, you retain a dense reward function that allows agents to learn even during failed runs, and you make sure the agent still recognizes that exiting the maze is more important than merely coming close to the exit.
2) Maybe the environment itself is too difficult or complex for an agent to learn to reach its goal within the constraints of your compute (aka the RL agent’s training time). Naive example: There is one exit to your maze, and 2,000,000 decisions to make, 99% of which end with an impasse. The agent never finds the needle in the haystack. In this case, you might try to vastly increase your compute and simulation runs. Or you might start training your agent on simple mazes, and use that trained model to jumpstart an agent that has to solve a more complex maze (one form of so-called curriculum learning).
3) Maybe you configured the agent wrong. That is, some problems are better cast as multi-agent problems than single agent problems (think: coordinating the action of a team on a field). Some problems require that the action space be defined as tuples (a single agent takes more than one action at once, just as you might press more than one button on a video game console at the same time to execute a complex move.)
> “Importantly, reward does not automatically spawn thoughts _about _reward, and reinforce those reward-focused thoughts!”
Standard RL agents do one thing: they attempt to maximize reward. They do not think about reward beyond that, and they also do not think about anything other than that.
There is a branch of RL called meta-learning where agents could arguably be said to “think about reward”, or at least to “think about learning faster, and exploring an unknown space to see which tasks and rewards are available.” Anyone curious should start with Sutton and Barto’s approach to RL before they graduate to meta-learning, which is being actively researched at DeepMind, Google Brain, and Stanford. Here are some places to start reading about meta-learning, although I highly recommend working through Sutton and Barto’s book first.
Standard RL agents take their reward function as a given. Including wireheading in an agent’s action space is a fundamentally different discussion that doesn’t apply to the vast majority of RL agents now. Mixing these two types of agents is not helpful to attaining clarity here.
For a standard RL agent, what constitutes reward is predefined by the human programmer. The RL agent will discover the state-action pairs that lead to maximum reward over the course of its learning. Even the human programmer does not know the best pathways through the environment; the programmers use the agent’s runs as a method of discovery, a search function, to surface new paths to a goal they have in mind.
This is also an important distinction vis-a-vis the utility function you mention. As I understand utility, at least in economics, it is often revealed by human behavior, e.g. by peoples’ choices and the prices they are willing to pay for experiences. That’s not the case with standard RL agents. We know their reward functions, because we wrote it. All they reveal are new methods to achieve the things that we programmed them to value.
There is no moment in a standard RL agent’s computational life when it reaches the end of a maze and asks itself: “what else might I enjoy doing besides solving mazes?” It does not generalize. It does not rewrite its reward function. That’s not included in the action space of these agents.
by definition “reward _must be_ what is optimized for by RL agents.”
This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward R on a trajectory τ will provide a set of gradients which push up logits on the actions at which produced the trajectory. The gradient on the parameters θ which parameterize the policy πθ is in the direction of increasing return J:
Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That’s it. There’s no math here that says “and the agent shall optimize for reward explicitly”; the math actually says “the agent’s parameterization is locally optimized by reward on the data distribution of the observations it actually makes.” Reward simply chisels cognition into agents (at least, in PG-style setups).
In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section “When is reward the optimization target of the agent?”, these settings probably don’t bear on smart alignment-relevant agents operating in reality.
The argument above isn’t clear to me, because I’m not sure how you’re defining your terms.
I should note that, contrary to the statement “reward is _not_, in general, that-which-is-optimized by RL agents”, by definition “reward _must be_ what is optimized for by RL agents.” If they do not do that, they are not RL agents. At least, that is true based on the way the term “reward” is commonly used in the field of RL. That is what RL agents are programmed by humans to do. They do that by changing their behavior over many trials, and testing the results of that behavioral change on the reward signals they receive.
The only case where that is not true is one where you define “reward” in some other way than Sutton, whom you quote. I would be curious to hear how you define reward. If you redefine it, that should be done explicitly, and contrasted to the pre-existing definition so that people can accurately interpret what you’ve written.
I don’t pretend to be the authority on RL, but I have a decent understanding of the basic RL loop by which an agent sends actions into an environment, and receives rewards and state updates from that environment. Here’s my understanding of commonly used deep RL algorithms, which I’ll refer to as standard RL:
First, it’s useful to make a few key distinctions, namely between:
- the reward signal (the quantum of reward that is actually allotted to the agent step by step);
- the reward function (also known as the objective function, which is the formula by which we decide how much reward to allot in response to an RL agent’s actions and accomplishments, and when);
- the environment in which an RL agent operates (the complex system that is altered by the agent’s actions, which includes the states through which the agent moves, the rules of state transitions, and the actions available in any state);
- the human programmer’s goals (ie the thing I the programmer want an agent to achieve that may be imperfectly expressed in the reward function I write for it), since the divergence between human wishes and the express incentives given to agents seems missing in this discussion.
An RL agent chooses from a set of actions it can take in any possible state. For any RL agent, taking actions over a series of states leads to some reward or succession of rewards, even if they amount to 0 (ie, the agent failed). The rewards can be disbursed according to the end state that the agent reaches during its run, or as the agent progresses through the run.
Example: I can define an objective function to reward the agent 10 points when it reaches a goal that I have decided is desirable (eg arrival at gramma’s house), or I can award the agent at each step that it comes progressively closer to gramma’s house. (The latter objective function can be much more effective, because it sends reward signals more frequently to the agent, thus allowing it to learn more quickly. These rewards are known as dense. Holding all rewards until an agent reaches a distant end state is often called a sparse reward function. Sparse rewards make it harder for an agent to learn.) Choosing between reward functions — ie rewriting the objective function — is known as reward shaping. Learning the right way to shape rewards is an iterative process for the people creating RL agents. That is, they write a reward function and check to see whether it leads an RL agent to exhibit the right behavior. (Feel free to get meta about that…)
This notion of the “right behavior” leads me to another point. The people creating RL agents have an idea of what they would like the agents to do. They attempt to express their wishes in mathematical terms with a reward function. Sometimes, the reward function does not incentivize agents in the way the programmer wants. (A similar situation is found in software programming more generally: the computer will do precisely what you tell it to, but not necessarily what you want. That is, there is often a difference between what we want to say and what actually comes out of our mouths; between what we hope the computer will do and what we told it to do with code; between what we want the RL agent to achieve, and what our rewards will lead it to do.) In other words, when great precision is required, it is easy to give an RL agent perverse incentives by accident. This notion of perverse incentives, familiar to any one working in a large institution, will hopefully serve as a useful analogy for the ways human programmers fail to properly reward RL agents via the objective functions they write. Nonetheless, even if a reward function is poorly written, the agent strives to optimize for those rewards.
I’m not sure what you mean by “objective target”, but I’ll assume here that the objective function/reward function is the explicit definition of the objective target. If an RL agent does not achieve the objective target, there are a couple of ways to troubleshoot the agent’s poor performance.
1) Maybe you wrote the objective function wrong; that is, maybe you were rewarding behavior that will not lead the RL agent to succeed in the terms you imagined. Naive example: You want an RL agent to make its way through a difficult maze. Your reward function linearly allocates rewards to the agent the closer it comes to the exit to the maze (eg 1 step closer, 1 more point). The maze includes several dead ends that terminate one inch from the exit. The agent learns to take the turns that lead it to the end of those impasses. Solution: increase rewards exponentially as the agent nears the exit, with an additional dollop of reward when it exits. In this way, you retain a dense reward function that allows agents to learn even during failed runs, and you make sure the agent still recognizes that exiting the maze is more important than merely coming close to the exit.
2) Maybe the environment itself is too difficult or complex for an agent to learn to reach its goal within the constraints of your compute (aka the RL agent’s training time). Naive example: There is one exit to your maze, and 2,000,000 decisions to make, 99% of which end with an impasse. The agent never finds the needle in the haystack. In this case, you might try to vastly increase your compute and simulation runs. Or you might start training your agent on simple mazes, and use that trained model to jumpstart an agent that has to solve a more complex maze (one form of so-called curriculum learning).
3) Maybe you configured the agent wrong. That is, some problems are better cast as multi-agent problems than single agent problems (think: coordinating the action of a team on a field). Some problems require that the action space be defined as tuples (a single agent takes more than one action at once, just as you might press more than one button on a video game console at the same time to execute a complex move.)
> “Importantly, reward does not automatically spawn thoughts _about _reward, and reinforce those reward-focused thoughts!”
Standard RL agents do one thing: they attempt to maximize reward. They do not think about reward beyond that, and they also do not think about anything other than that.
There is a branch of RL called meta-learning where agents could arguably be said to “think about reward”, or at least to “think about learning faster, and exploring an unknown space to see which tasks and rewards are available.” Anyone curious should start with Sutton and Barto’s approach to RL before they graduate to meta-learning, which is being actively researched at DeepMind, Google Brain, and Stanford. Here are some places to start reading about meta-learning, although I highly recommend working through Sutton and Barto’s book first.
[Open-ended play](https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play)
[Task inference](https://www.deepmind.com/publications/meta-reinforcement-learning-as-task-inference)
[Chelsea Finn’s work](https://ai.stanford.edu/~cbfinn/)
Standard RL agents take their reward function as a given. Including wireheading in an agent’s action space is a fundamentally different discussion that doesn’t apply to the vast majority of RL agents now. Mixing these two types of agents is not helpful to attaining clarity here.
For a standard RL agent, what constitutes reward is predefined by the human programmer. The RL agent will discover the state-action pairs that lead to maximum reward over the course of its learning. Even the human programmer does not know the best pathways through the environment; the programmers use the agent’s runs as a method of discovery, a search function, to surface new paths to a goal they have in mind.
This is also an important distinction vis-a-vis the utility function you mention. As I understand utility, at least in economics, it is often revealed by human behavior, e.g. by peoples’ choices and the prices they are willing to pay for experiences. That’s not the case with standard RL agents. We know their reward functions, because we wrote it. All they reveal are new methods to achieve the things that we programmed them to value.
There is no moment in a standard RL agent’s computational life when it reaches the end of a maze and asks itself: “what else might I enjoy doing besides solving mazes?” It does not generalize. It does not rewrite its reward function. That’s not included in the action space of these agents.
This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward R on a trajectory τ will provide a set of gradients which push up logits on the actions at which produced the trajectory. The gradient on the parameters θ which parameterize the policy πθ is in the direction of increasing return J:
∇θJ(πθ)=Eτ∼πθ[T∑t=0∇θlogπθ(at∣st)R(τ)]You can read more about this here.
Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That’s it. There’s no math here that says “and the agent shall optimize for reward explicitly”; the math actually says “the agent’s parameterization is locally optimized by reward on the data distribution of the observations it actually makes.” Reward simply chisels cognition into agents (at least, in PG-style setups).
In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section “When is reward the optimization target of the agent?”, these settings probably don’t bear on smart alignment-relevant agents operating in reality.