FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain. It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.
At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn’t picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better—the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.
They seem to think it has more to do with label quality / better raters, which I didn’t think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven’t read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.
Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly
Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!
Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!
This idea has come up at CHAI occasionally, but I don’t think anyone has actually run with it—do you know any examples of work that does this from (possibly simulated) human feedback? I’m pretty curious to see how much white-box optimization helps.
No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.)
IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.
Strongly agree. It’s obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.
At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn’t picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better—the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.
They seem to think it has more to do with label quality / better raters, which I didn’t think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven’t read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.
Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!
Some new links on that topic: https://fraser-greenlee.github.io/2020/08/13/Transformers-as-Variational-Autoencoders.html https://fraser-greenlee.github.io/2020/08/25/Transformer-VAE-for-Program-Synthesis.html
This idea has come up at CHAI occasionally, but I don’t think anyone has actually run with it—do you know any examples of work that does this from (possibly simulated) human feedback? I’m pretty curious to see how much white-box optimization helps.
No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a ‘optimize parameters based on data & loss’ mindset, and few ever use the alternatives like ‘optimize data/trajectory based on parameters & loss’ or ’optimize loss based on data/parameters.)
Strongly agree. It’s obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.