where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
I’m confused how this is not a horizon? Perhaps we’re using words differently—I’m saying “there’s a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it’s the sentence length and in your setting it is 1; nothing else would need to change”.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
That all makes sense, except for this part:
I’m confused how this is not a horizon? Perhaps we’re using words differently—I’m saying “there’s a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it’s the sentence length and in your setting it is 1; nothing else would need to change”.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
Ah got it, that makes sense, I agree with all of that.