Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!
This idea has come up at CHAI occasionally, but I don’t think anyone has actually run with it—do you know any examples of work that does this from (possibly simulated) human feedback? I’m pretty curious to see how much white-box optimization helps.
No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.)
IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.
Strongly agree. It’s obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.
This idea has come up at CHAI occasionally, but I don’t think anyone has actually run with it—do you know any examples of work that does this from (possibly simulated) human feedback? I’m pretty curious to see how much white-box optimization helps.
No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.)
Strongly agree. It’s obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.