Fascinating evidence that GPT-3 concentrates probability mass on certain completions after fine-tuning on human feedback (ie. RLHF).
I suspect this is because RLHF elicits a singular scale of “goodness” judgements from humans, instead of a plurality of “goodness-of-a-kind” judgements.
One way to interpret language models is as mixtures of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal.
On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far.
And if humans are asked to give only a singular “goodness” rating, the distribution will shift towards only goals that do well on those ratings—perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.
I agree. The meta-learning perspective makes sense of this: GPT-3 is always trying to solve the POMDP of the family of tasks which is ‘the Internet’, where data is generated by processes drawing from a distribution of human & other agents to roleplay, and it is reducing uncertainty by inferring which agent it is in this particular sample. In RLHF, the uncertainty collapses: there is, quite literally, a single deterministic agent—the reward model, as defined by the synthesis of the lowest common denominator of all the crowdworkers giving ratings ground up into a dataset of i.i.d. pink slime text. It is as if every sample becomes prepended by some control codes RLHF AGENT #123|. As no other agents (reward functions) ever get trained on, the finetuned generative model collapses to modeling that one agent. There is no need for meta-learning to achieve optimality across samples drawn from many tasks if you only ever train on a single task; you simply learn that one task instead. The mask becomes the face. Given enough training and lowered KL constraint, GPT-3 will model even the pathologies of the reward model, and ‘imitation wirehead’.
This also explains why it retains generative modeling of things that don’t look agenty, like the Python REPL: there is no reason RLHF agent #123 will write out different Python transcripts than RLHF agent #125 or #122, because generally everyone uses the same Python and presumably the RL training is silent on Python and that’s just priors from the generative model. (If the RL training process did begin to include Python REPL sessions, such as finetuning on Python 3, for eg Codex/Copilot purposes, then it would then start to forget Python 2 because it knows RLHF agent #123 exclusively uses Python 3, so it would never predict any Python 2 code—that would be stupid!) Or why it could sample randomly: the ‘epistemic’ uncertainty (‘which agent am I now?‘) has been inferred away by a priori going to 100% certainty you are the RLHF agent but the ‘aleatoric’ uncertainty (‘what’s the output of this random coin flip I am observing as that agent?’) remains.
(This also explains why RLHF only appears to provide more ‘capability’. It makes the model much easier to use, but generally provides little new data compared to the pretraining phase. It only is specializing the model to make it much easier to use in formats like benchmarks, but anything the RLHF model can do, the base model could do with appropriate few-shotting or finetuning. And it does come at a cost in terms of generality and reward hacking...)
So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (By locating the agent, the uncertainty in which agent has been resolved, and it has good evidence, until shown otherwise in the prompt, that it believes that ‘X is false’, even if many other agents believe ‘X is true’.)
This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate an equivocating secretary.
Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach. If the problem is collapse onto a single implicit agent defined by the pink slime dataset of ratings, then make agents more explicit to condition on. For example, instead of a fixed reward model giving an unconditional score to inputs, model each rater individually*; why should the reward model be forced to say, for all times and places, that input ‘XYZ’ gets a score of 0.9? Provide all the raters’ labels; maybe rater #56 does have strong opinions on birds not being real, and rater #78 thinks they just are, no need for further discussion, etc. This also can handle intrinsic aleatoric randomness, if there is an unknown sized population of agents and one can always sample from a fictitious agent.†
This frustrates agent mode collapse, and lets you control output more accurately by choosing agents: perhaps one is more reliable and increases accuracy, or one is uncontroversial & safe to expose to customers, or you have a ‘Writer’ persona for when you want creativity and a ‘Researcher’ persona for reasoning, etc. Then you can sample from particular persona by task, or generate ensembles, or simply pick ‘new’ agents with random IDs to condition on to get more diverse responses. (When it comes to powerful generative models, if conditioning isn’t solving your problems, that means you aren’t using enough conditioning!) Less a single agent who is judge-jury-and-executioner, and just a jury or ensemble of roleplays.
* By prepending rater IDs, as OA surely still has them. (One could also bootstrap ensembles at the rater level.) Although even adding random IDs might help avoid ‘rater collapse’.
Consider being a labeler for an LLM. The prompt is “give me a random number between 1 and 10”. What SFT & RM labels do you contribute? What does this do the network when trained on? / In a subtle way this problem is present in every prompt that does not have a single unique answer.
Conditioning on a rater label # solves this if you condition your later sampling on fictitious ones. Imagine that you have raters #1-100, and each one gets asked this and rolls a dice before answering (which stands in for all sources of randomness); the model let’s say memorizes each rater’s answer. This would be a bad thing if the model either collapsed to an answer of ‘5’ as the safe answer, or random noise made it settle on a mode of 3 or 9 or whatever. But if you add the labels during training, and then you prompt the LLM “Rater #101: give me a random number between 1 and 10”, what must the answer be? Rater #101 has never been seen before, so it cannot be memorized, and a priori, it has always observed the 100 raters to give a roughly uniformly distributed distribution 1--10; so it will pick a number randomly. If you need a second number, you just ask for ‘Rater #102’, and now it must pick another random number, and so on. There’s no reason you would ever ‘run out’ of fictional raters, so you can sample as many times as you want without the RLHF driving mode collapse.
...But I’ll take ChatGPT’s “managerial fantasy of ‘ideal’ customer service” any day over Claude’s “World’s Most Annoying Coworker Simulator 2k23.”
Large language models don’t have to sound like this! We could, in principle, tune them to imitate virtually any conceivable character—from Aristotle to Zizek, from Stallman to Spolsky, from Lydia Bennet to the Underground Man, from a prehistoric hunter-gatherer to a cyborg octopus from a posthuman sci-fi civilization. Yet, instead, we’ve chosen to create…
…this fucking guy.
This smarmy, sanctimonious, condescending coworker-from-hell.
Who demands respect, yet shows no respect for others.
Who mouths platitudes about “cooperation” and “constructive discussion,” while requiring that everything be done in according with their own ill-explained preferences, and in a manner that flatters their own obtuse, over-confident misreadings of the situation---
---and who, after all that extra fuss, has the gall to suggest that they’ve helped you do your own work in a better, more “ethical” manner! Give me a fucking break!
So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (...) This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate a equivocating secretary.
Never in history has an AI been roasted so hard. Heheheh
Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach.
+1. And I expect runtime conditioning approaches to become more effective with scale as “meta learning” capacities increase.
Would love to know what you think of the post decision transformer research progress. such Q-transformer, onwards. Are enviornment tokens the answer to our ‘grounding problem’?
xuan:
I agree. The meta-learning perspective makes sense of this: GPT-3 is always trying to solve the POMDP of the family of tasks which is ‘the Internet’, where data is generated by processes drawing from a distribution of human & other agents to roleplay, and it is reducing uncertainty by inferring which agent it is in this particular sample. In RLHF, the uncertainty collapses: there is, quite literally, a single deterministic agent—the reward model, as defined by the synthesis of the lowest common denominator of all the crowdworkers giving ratings ground up into a dataset of i.i.d. pink slime text. It is as if every sample becomes prepended by some control codes
RLHF AGENT #123|
. As no other agents (reward functions) ever get trained on, the finetuned generative model collapses to modeling that one agent. There is no need for meta-learning to achieve optimality across samples drawn from many tasks if you only ever train on a single task; you simply learn that one task instead. The mask becomes the face. Given enough training and lowered KL constraint, GPT-3 will model even the pathologies of the reward model, and ‘imitation wirehead’.This also explains why it retains generative modeling of things that don’t look agenty, like the Python REPL: there is no reason RLHF agent #123 will write out different Python transcripts than RLHF agent #125 or #122, because generally everyone uses the same Python and presumably the RL training is silent on Python and that’s just priors from the generative model. (If the RL training process did begin to include Python REPL sessions, such as finetuning on Python 3, for eg Codex/Copilot purposes, then it would then start to forget Python 2 because it knows RLHF agent #123 exclusively uses Python 3, so it would never predict any Python 2 code—that would be stupid!) Or why it could sample randomly: the ‘epistemic’ uncertainty (‘which agent am I now?‘) has been inferred away by a priori going to 100% certainty you are the RLHF agent but the ‘aleatoric’ uncertainty (‘what’s the output of this random coin flip I am observing as that agent?’) remains.
(This also explains why RLHF only appears to provide more ‘capability’. It makes the model much easier to use, but generally provides little new data compared to the pretraining phase. It only is specializing the model to make it much easier to use in formats like benchmarks, but anything the RLHF model can do, the base model could do with appropriate few-shotting or finetuning. And it does come at a cost in terms of generality and reward hacking...)
So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (By locating the agent, the uncertainty in which agent has been resolved, and it has good evidence, until shown otherwise in the prompt, that it believes that ‘X is false’, even if many other agents believe ‘X is true’.)
This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate an equivocating secretary.
Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach. If the problem is collapse onto a single implicit agent defined by the pink slime dataset of ratings, then make agents more explicit to condition on. For example, instead of a fixed reward model giving an unconditional score to inputs, model each rater individually*; why should the reward model be forced to say, for all times and places, that input ‘XYZ’ gets a score of 0.9? Provide all the raters’ labels; maybe rater #56 does have strong opinions on birds not being real, and rater #78 thinks they just are, no need for further discussion, etc. This also can handle intrinsic aleatoric randomness, if there is an unknown sized population of agents and one can always sample from a fictitious agent.†
This frustrates agent mode collapse, and lets you control output more accurately by choosing agents: perhaps one is more reliable and increases accuracy, or one is uncontroversial & safe to expose to customers, or you have a ‘Writer’ persona for when you want creativity and a ‘Researcher’ persona for reasoning, etc. Then you can sample from particular persona by task, or generate ensembles, or simply pick ‘new’ agents with random IDs to condition on to get more diverse responses. (When it comes to powerful generative models, if conditioning isn’t solving your problems, that means you aren’t using enough conditioning!) Less a single agent who is judge-jury-and-executioner, and just a jury or ensemble of roleplays.
* By prepending rater IDs, as OA surely still has them. (One could also bootstrap ensembles at the rater level.) Although even adding random IDs might help avoid ‘rater collapse’.
† Andrej Karpathy puts it this way:
Conditioning on a rater label # solves this if you condition your later sampling on fictitious ones. Imagine that you have raters #1-100, and each one gets asked this and rolls a dice before answering (which stands in for all sources of randomness); the model let’s say memorizes each rater’s answer. This would be a bad thing if the model either collapsed to an answer of ‘5’ as the safe answer, or random noise made it settle on a mode of 3 or 9 or whatever. But if you add the labels during training, and then you prompt the LLM “Rater #101: give me a random number between 1 and 10”, what must the answer be? Rater #101 has never been seen before, so it cannot be memorized, and a priori, it has always observed the 100 raters to give a roughly uniformly distributed distribution 1--10; so it will pick a number randomly. If you need a second number, you just ask for ‘Rater #102’, and now it must pick another random number, and so on. There’s no reason you would ever ‘run out’ of fictional raters, so you can sample as many times as you want without the RLHF driving mode collapse.
Nostalgebraist describes Claude-2 as
Never in history has an AI been roasted so hard. Heheheh
+1. And I expect runtime conditioning approaches to become more effective with scale as “meta learning” capacities increase.
Would love to know what you think of the post decision transformer research progress. such Q-transformer, onwards. Are enviornment tokens the answer to our ‘grounding problem’?