I thought this series of comments from a former DeepMind employee (who worked on Gemini) were insightful so I figured I should share.
From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.
It’s also know that more capable models exploit loopholes in reward functions better. Imo, it’s a pretty intuitive idea that more capable RL agents will find larger rewards. But there’s evidence from papers like this as well: https://arxiv.org/abs/2201.03544
To be clear, I don’t think the current paradigm as-is is dangerous. I’m stating the obvious because this platform has gone a bit bonkers.
The danger comes from finetuning LLMs to become AutoGPTs which have memory, actions, and maximize rewards, and are deployed autonomously. Widepsread proliferation of GPT-4+ models will almost certainly make lots of these agents which will cause a lot of damage and potentially cause something indistinguishable from extinction.
These agents will be very hard to align. Trading off their reward objective with your “be nice” objective won’t work. They will simply find the loopholes of your “be nice” objective and get that nice fat hard reward instead.
We’re currently in the extreme left-side of AutoGPT exponential scaling (it basically doesn’t work now), so it’s hard to study whether more capable models are harder or easier to align.
Other comments from that thread:
My guess is where your intuitive alignment strategy (“be nice”) breaks down for AI is that unlike humans, AI is highly mutable. It’s very hard to change a human’s sociopathy factor. But for AI, even if *you* did find a nice set of hyperparameters that trades off friendliness and goal-seeking behavior well, it’s very easy to take that, and tune up the knobs to make something dangerous. Misusing the tech is as easy or easier than not. This is why many put this in the same bucket as nuclear.
US visits Afghanistan, teaches them how to make power using Nuclear tech, next month, they have nukes pointing at Iran.
And:
In contexts where harms will be visible easily and in short timelines, we’ll take them offline and retrain.
Many applications will be much more autonomous, difficult to monitor or even understand, and potentially fully close loop, i.e the agent has a complex enough action space that it can copy itself, buy compute, run itself, etc.
I know it sounds scifi. But we’re living in scifi times. These things have a knack of becoming true sooner than we think.
No ghosts in the matrices assumed here. Just intelligence starting from a very good base model optimizing reward.
There are more comments he made in that thread that I found insightful, so go have a look if interested.
“larger models exploit the RM more” is in contradiction with what i observed in the RM overoptimization paper. i’d be interested in more analysis of this
In that paper did you guys take a good long look at the output of various sized models throughout training? In addition to looking at the graphs of gold-standard/proxy reward model ratings against KL-divergence. If not, then maybe that’s the discrepancy: perhaps Sherjil was communicating with the LLM and thinking “this is not what we wanted”.
I thought this series of comments from a former DeepMind employee (who worked on Gemini) were insightful so I figured I should share.
Other comments from that thread:
And:
There are more comments he made in that thread that I found insightful, so go have a look if interested.
“larger models exploit the RM more” is in contradiction with what i observed in the RM overoptimization paper. i’d be interested in more analysis of this
In that paper did you guys take a good long look at the output of various sized models throughout training? In addition to looking at the graphs of gold-standard/proxy reward model ratings against KL-divergence. If not, then maybe that’s the discrepancy: perhaps Sherjil was communicating with the LLM and thinking “this is not what we wanted”.