gwern comments on on “learning to summarize”

gwern 12 Sep 2020 15:01 UTC
4 points

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain. It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.

At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn’t picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better—the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.

They seem to think it has more to do with label quality / better raters, which I didn’t think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven’t read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.

Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

Some new links on that topic: https://fraser-greenlee.github.io/2020/08/13/Transformers-as-Variational-Autoencoders.html https://fraser-greenlee.github.io/2020/08/25/Transformer-VAE-for-Program-Synthesis.html
- Rohin Shah 12 Sep 2020 17:43 UTC
  2 points
  Parent
  Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!
  This idea has come up at CHAI occasionally, but I don’t think anyone has actually run with it—do you know any examples of work that does this from (possibly simulated) human feedback? I’m pretty curious to see how much white-box optimization helps.
  - gwern 12 Sep 2020 18:01 UTC
    4 points
    Parent
    No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.)
    - Rohin Shah 12 Sep 2020 18:20 UTC
      2 points
      Parent
      IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.
      Strongly agree. It’s obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.