Stephen McAleese comments on Reward is not the optimization target

Stephen McAleese 4 Apr 2024 22:30 UTC
1 point
−2
OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.
But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state $Q (s, a_{i})$ and then chooses the action with the highest value.
Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?