This post seems basically correct to me, but I do want to caveat this bit:
What actually happens is this:
The model takes a series of actions (which we collect across multiple “episodes”).
After collecting these episodes, we determine how good the actions in each episode are using a reward function.
We use gradient descent to alter the parameters of the model so the good actions will be more likely and the bad actions will be less likely when we next collect some episodes.
Directly using the episodic rewards to do gradient descent on the policy parameters is one class of policy optimization approaches (see vanilla policy gradient, REINFORCE). Some of the other popular RL methods add an additional component beyond the policy—a baseline or learned value function—which may “see” the rewards directly upon receipt and which is used in combination with the reward function to determine the policy gradients (see REINFORCE with a baseline, actor-critic methods, PPO). In value-based methods, the value function is directly updated to become a more consistent predictor of reward (see value iteration, Q-learning). More complex methods that I probably wouldn’t call vanilla RL can use a model to do planning, in which case the agent does really “see” the reward by way of the model and imagined rollouts.
I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.
However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn’t change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never “sees” the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I’m sure there are schemes where this is not the case, but I think I’m correct for PPO.)
I’m less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn’t the reward itself. For example, I would say that the MCTS procedure of MuZero does “want” something, but that thing is not plans that get high reward, but plans that score highly according to the system’s value function. (I’m happy to be deferential on this though.)
The other interesting case is Decision Transformers. DTs absolutely “get” reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that “want” reward.
Considering the fact that weight sharing between actor and critic networks is a common practice, and given the fact that the critic passes gradients (learned from the reward) to the actor, for most practical purposes the actor gets all of the information it needs about the reward.
Yeah. For non-vanilla PG methods, I didn’t mean to imply that the policy “sees” the rewards in step 1. I meant that a part of the agent (its value function) “sees” the rewards in the sense that those are direct supervision signals used to train it in step 2, where we’re determining the direction and strength of the policy update.
And yeah the model-based case is weirder. I can’t recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I’d think it’s fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) “wants” reward.
This post seems basically correct to me, but I do want to caveat this bit:
Directly using the episodic rewards to do gradient descent on the policy parameters is one class of policy optimization approaches (see vanilla policy gradient, REINFORCE). Some of the other popular RL methods add an additional component beyond the policy—a baseline or learned value function—which may “see” the rewards directly upon receipt and which is used in combination with the reward function to determine the policy gradients (see REINFORCE with a baseline, actor-critic methods, PPO). In value-based methods, the value function is directly updated to become a more consistent predictor of reward (see value iteration, Q-learning). More complex methods that I probably wouldn’t call vanilla RL can use a model to do planning, in which case the agent does really “see” the reward by way of the model and imagined rollouts.
I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.
However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn’t change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never “sees” the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I’m sure there are schemes where this is not the case, but I think I’m correct for PPO.)
I’m less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn’t the reward itself. For example, I would say that the MCTS procedure of MuZero does “want” something, but that thing is not plans that get high reward, but plans that score highly according to the system’s value function. (I’m happy to be deferential on this though.)
The other interesting case is Decision Transformers. DTs absolutely “get” reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that “want” reward.
Considering the fact that weight sharing between actor and critic networks is a common practice, and given the fact that the critic passes gradients (learned from the reward) to the actor, for most practical purposes the actor gets all of the information it needs about the reward.
This is the case for many common architectures.
Yeah. For non-vanilla PG methods, I didn’t mean to imply that the policy “sees” the rewards in step 1. I meant that a part of the agent (its value function) “sees” the rewards in the sense that those are direct supervision signals used to train it in step 2, where we’re determining the direction and strength of the policy update.
And yeah the model-based case is weirder. I can’t recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I’d think it’s fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) “wants” reward.