Rohin Shah comments on on “learning to summarize”

Rohin Shah 12 Sep 2020 17:43 UTC
2 points
Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!
This idea has come up at CHAI occasionally, but I don’t think anyone has actually run with it—do you know any examples of work that does this from (possibly simulated) human feedback? I’m pretty curious to see how much white-box optimization helps.
- gwern 12 Sep 2020 18:01 UTC
  4 points
  Parent
  No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.)
  - Rohin Shah 12 Sep 2020 18:20 UTC
    2 points
    Parent
    IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.
    Strongly agree. It’s obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.