OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.
But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state Q(s,ai) and then chooses the action with the highest value.
Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?
OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.
But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state Q(s,ai) and then chooses the action with the highest value.
Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?