Yeah. For non-vanilla PG methods, I didn’t mean to imply that the policy “sees” the rewards in step 1. I meant that a part of the agent (its value function) “sees” the rewards in the sense that those are direct supervision signals used to train it in step 2, where we’re determining the direction and strength of the policy update.
And yeah the model-based case is weirder. I can’t recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I’d think it’s fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) “wants” reward.
Yeah. For non-vanilla PG methods, I didn’t mean to imply that the policy “sees” the rewards in step 1. I meant that a part of the agent (its value function) “sees” the rewards in the sense that those are direct supervision signals used to train it in step 2, where we’re determining the direction and strength of the policy update.
And yeah the model-based case is weirder. I can’t recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I’d think it’s fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) “wants” reward.