To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
Ah got it, that makes sense, I agree with all of that.