AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.
The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.
So while on paper this looked like a huge leap forward for Google, it seems that in practice it’s not performing as well as they might have hoped.
But is this due to limitations of RLHF training, or something else?
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
From a previous comment: [...] This seems to be evidence that RLHF does not tend to generalize well out-of-distribution
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
Can you expand on “Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.”?
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From a previous comment:
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
Apparently Gemini 1.5 Pro isn’t working great with large contexts:
But is this due to limitations of RLHF training, or something else?
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Oops, fixed.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Even worse, apparently the whole Superalignment team has been disbanded.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).