nostalgebraist comments on on “learning to summarize”

nostalgebraist 13 Sep 2020 17:58 UTC
1 point
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
- Rohin Shah 13 Sep 2020 21:18 UTC
  3 points
  Parent
  Ah got it, that makes sense, I agree with all of that.