Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training.
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).