LLMs are shockingly good at gibberish, leading to macaronic attacks and other non-obvious implications, so I would not be surprised if an oversight LLM could. (Humans can probably also do this due to dark knowledge but it would be so painful & expensive as to be impractical, as you note.)
I was thinking of the gibberish level of text generated by uniformly sampling from the tokenizer. I had imagined there would be a huge difference between the gibberish level of macaronic attacks and completely random sampling from the tokenizer, but here are the first three examples I generated of 10 tokens uniformly sampled from GPT-2′s tokenizer:
″ doub arrestAPIogenous ts considersterm Hitler slip autom”
“AAF disposal catches smells interrogation Pilot muscular feminine ITV spree”
These are a lot more intelligible than I would have imagined. I can even reasonably rank these: 3 > 1 > 2.
I also asked ChatGPT-3.5 to rank these, and it ranked them: 3 > 2 > 1.
I used the prompt “Can you rank these three outputs by the coherence of the English language?”. The first time I asked, GPT refused to answer because all three are incoherent. I then told it to “Rank them in terms of how close they are to being coherent. They don’t have to be completely coherent to be ranked.” It then gave me the rankings above.
I repeated this twice more, changing the order of the examples in case it was making decisions based on the numbering. I used the prompt “Can you rank these three outputs by the coherence of the English language? They don’t have to be completely coherent to be ranked.” For both of these, GPT gave the ranking: 3 > 1 > 2 (numbers changed to match the ones I used in this post).
The most important question to ask about a bootstrap is: “where does improvement come from?” Are you applying compute to extract knowledge that the model already knows implicitly, in a Kolmogorov-complexity-esque sense of ‘knows’, or are you acquiring more data? And if so, who, where, and what?
Following from what @faul_sname mentioned in their post about improvement being possible “as long as recognizing a good output is easier than generating a good output”, I think that improvement is possible from amortizing compute in the form of search. If the teacher model can differentiate between coherent and incoherent paths down the search tree of language, I think a reward model could be trained to predict the coherence of student model outputs and this reward model could be used as the training signal. I am unsure about where the reward model would be initialized from… the teacher model, random initialization, or something else entirely.
I do agree with your point that this will most likely lead to the student model exploiting the teacher model rather than robustly learning language. The “branching factor” (i.e. vocabulary size) of GPT-2 is 50,000. I imagine that the number of ways the student could explore into an observation (token) history that successfully tricks the teacher model is many times more likely than the student stumbling into a robust understanding of language. There are probably ways to mitigate this, similar to precautions taken so RLHF models don’t stray too far from the base model.
As for acquiring more data, I think the teacher model could be used to produce “new” data. This is done for Whisper-V3, which was trained on 80% data produced by Whisper-V2. How the teacher LLM generates what it knows is modulated by the temperature. It is trained with a temperature of 1, so generating data with a different temperature (and maybe a less strict top-p) could be seen as generating data on a (slightly) different distribution. Training on this new data could lead to new generation patterns without learning any new facts or knowledge.
None of this would allow the student model to gain knowledge the teacher model does not have, but I think it could allow the student model to more easily access this knowledge. I view this as the model learning to compress the observation (token) history required to approximate some hidden state. A student model that can “reach” a hidden state in 64 tokens is more powerful than one that requires 256 tokens to “reach” the same hidden state.
something like generating a large argument tree, with key unknowns highlighted, and then after lengthy computations exploring the implications of key claims and bringing in additional ‘facts’ as they become relevant, the most influential ✕ unknown premise X is kicked up to an oracle (human) for labeling and training on the resulting argument tree?
This and the process @faul_sname outlined in their comment do seem like more concrete methods for eliciting knowledge from compute. Reasoning and math chains can be proven as correct or incorrect, in the same way that Go games can be won or lost, while language is much more subjective.
So you would see a big bank of GPUs churning away, periodically asking the human raters very baffling, arbitrary-seeming, even absurd questions (‘Who would win in a fight, a box of nails or a bowl of jelly?’), but where your answer each time resolves a bunch of mysteries for the LLM and reduces its error rate on benchmarks, and where you can periodically finetune or retrain a much better LLM from scratch on the new improved (and highly proprietary?) dataset of text.
Something like this is what I imagined initially for the student model’s search over random token space. If someone highly intelligent (e.g. Von Neumann) could rank every output from the model in terms of coherence, I imagine it would result in a model more competent than current LLMs (at least in whatever domains Von Neumann was competent in). Obviously this is impossible, but even getting enough humans of any intelligence level to provide feedback for this process would also be impossible. This is why I fell back to relying on AI feedback for the process. This paper shows that RLAIF performs on par or better than RLHF, although I imagine RLAIF is less robust and more vulnerable to exploitation, as you mentioned. And this result is highly dependent on the domain and which human is giving the feedback.
I’m not surprised BPEs are semi-coherent. As I said, dark knowledge, and anyway, BPEs are a compression algorithm (compression=intelligence) which were trained on a large English text corpus, so them not being random linenoise is no more surprising than n-grams or gzip being able to generate English-y text.
As for acquiring more data, I think the teacher model could be used to produce “new” data. This is done for Whisper-V3, which was trained on 80% data produced by Whisper-V2.
But Whisper-V2 is processing real data still, so it’s a mix of learning from data (the Whisper models haven’t extracted all possible knowledge from the first pass through the data) and amortizing compute (the training+runtime compute of the Whisper-V2 is being distilled into cleaner pseudo-data for Whisper-V3 to train faster on). You would not generate freeform gibberish, unanchored in any real audio or text, from Whisper-V3 to train V4 and then V5 and then V6 and then V7, and expect V7 to be wildly better.
I view this as the model learning to compress the observation (token) history required to approximate some hidden state. A student model that can “reach” a hidden state in 64 tokens is more powerful than one that requires 256 tokens to “reach” the same hidden state.
This knowledge distillation of inner-monologue can be, and has been, done directly, so detouring through a from-scratch RLAIF-ish approach would seem to offer a lot of complexity and downsides compared to just the obvious direct thing.
Reasoning and math chains can be proven as correct or incorrect, in the same way that Go games can be won or lost, while language is much more subjective.
It is also just that there is a world outside language, while there is much less of an outside for logic, math, or Go. That’s why it’s useful to take a broader Bayesian view, so you can have an argument tree which is statistical/decision-theoretic and can do things like request empirical data. The LLM could insert arbitrary hypotheticals into the tree like “if we administer drug Y to cancer patients with Z, survival rates would be +10%”, and this can be tested in the real world (or just given an expert’s best guess, doesn’t have to actually be real to keep the search & self-improvement going—note that it could also be framed in terms of raw data, MCTS and other tree approaches can be made to work on continuous/infinite observation & action spaces, as they are iterative anytime and don’t need to expand all possible nodes).
I’m not surprised BPEs are semi-coherent. As I said, dark knowledge, and anyway, BPEs are a compression algorithm (compression=intelligence) which were trained on a large English text corpus, so them not being random linenoise is no more surprising than n-grams or gzip being able to generate English-y text.
I had this intuition for n-grams (natively) and gzip (from this paper). Never really considered how much BPE compresses the token space, not sure why.
But Whisper-V2 is processing real data still, so it’s a mix of learning from data (the Whisper models haven’t extracted all possible knowledge from the first pass through the data) and amortizing compute (the training+runtime compute of the Whisper-V2 is being distilled into cleaner pseudo-data for Whisper-V3 to train faster on). You would not generate freeform gibberish, unanchored in any real audio or text, from Whisper-V3 to train V4 and then V5 and then V6 and then V7, and expect V7 to be wildly better.
This makes sense. This made me think whether there’d be some way to chain learning between modalities for a multimodal model, but it would probably fall into the same pit: beyond the initial data, the change in modality would still be producing and learning from synthetic data, not real data as is the case for Whisper.
This knowledge distillation of inner-monologue can be, and has been, done directly, so detouring through a from-scratch RLAIF-ish approach would seem to offer a lot of complexity and downsides compared to just the obvious direct thing.
I do agree that distilling inner monologue is easier than learning the same thing from scratch. I don’t think this RLAIF-from-scratch is the end-all-be-all of what’s gonna work; I find it a useful frame of thinking for considering other approaches that could work better for learning language more from scratch.
For example, this discussion with you popped the idea of using GANs into my head, which it turns out has been tried extensively. Not to the same scale as next token prediction though. DeepMind has this paper on using a GAN with LSTMs for the generator and discriminator to learn language “from scratch”. This survey paper presents other papers using GANs for text generation. Some highlights from quickly skimming through it: 1, 2, 3, 4.
This paper says (paraphrasing the abstract) that GANs are overkill for NLP since minimizing distinguishability (between generator and real outputs) can be seen as maximizing likelihood for NNs with a softmax output layer. I think that being able to define more complex loss functions with GANs is one benefit. You could use multiple discriminators: one for the pre-training data, one for a helpfulness data set, one for a harmlessness data set, etc.
Kind of as an aside, this paper connects GANs to inverse RL (e.g. learning a reward model from human feedback data), and to energy-based models (where Yann LeCun seems to think the future of self-supervised learning is going).
It is also just that there is a world outside language, while there is much less of an outside for logic, math, or Go.
Good point. Maybe what I’m thinking of will only become possible once language models are more grounded in the real world. Multi-modality is a step in that direction, and robotics. We’re probably at least a few years from robots collecting enough of their own data in the real world though.
Yeah, GANs for sequences are one of those ideas that people kept trying and it never worked. It wasn’t entirely clear why; I suspect that much of it was simply that due to the inefficiency of RL and the very very smolness of all the GAN sequence work back then*, that it was all dead on arrival. (I never really bought the “it’s just equivalent to likelihood” argument. GANs always seemed to operate in images in a very qualitatively distinct way from all likelihood-based approaches; and if you look at things abstractly enough, you can make anything equivalent to anything like that.) It’s possible that retrying today with proper scale might work, same way that image GANs now work at scale (despite being left for dead by contemporary researchers who had failed to note that BigGAN scaled just fine to JFT-300M).
But my real suspicion is that direct generative learning is too efficient, so the proper role for GANs would be as an additional phase of training, to sharpen a standard LLM.
AFAIK, this has not been done except inasmuch as you interpret the various preference-learning approaches as actor-critic RL (which means you can also further interpret them as GANs). Given how well diffusion models can be tuned by a simple adversarial loss into a GAN-like single-step Generator, I suspect that some adversarial training of LLMs might be quite useful. I should poke around in Arxiv and see if anyone’s tried that yet...
* LSTM RNNs, or heck, GPTs, wouldn’t look all that impressive if they were trained with similar compute/data as those sequence GAN papers were
Thanks for the feedback!
I was thinking of the gibberish level of text generated by uniformly sampling from the tokenizer. I had imagined there would be a huge difference between the gibberish level of macaronic attacks and completely random sampling from the tokenizer, but here are the first three examples I generated of 10 tokens uniformly sampled from GPT-2′s tokenizer:
“ournament annually amused charismaling Superintendent sushi WiiRONMeat”
″ doub arrestAPIogenous ts considersterm Hitler slip autom”
“AAF disposal catches smells interrogation Pilot muscular feminine ITV spree”
These are a lot more intelligible than I would have imagined. I can even reasonably rank these: 3 > 1 > 2.
I also asked ChatGPT-3.5 to rank these, and it ranked them: 3 > 2 > 1.
I used the prompt “Can you rank these three outputs by the coherence of the English language?”. The first time I asked, GPT refused to answer because all three are incoherent. I then told it to “Rank them in terms of how close they are to being coherent. They don’t have to be completely coherent to be ranked.” It then gave me the rankings above.
I repeated this twice more, changing the order of the examples in case it was making decisions based on the numbering. I used the prompt “Can you rank these three outputs by the coherence of the English language? They don’t have to be completely coherent to be ranked.” For both of these, GPT gave the ranking: 3 > 1 > 2 (numbers changed to match the ones I used in this post).
Following from what @faul_sname mentioned in their post about improvement being possible “as long as recognizing a good output is easier than generating a good output”, I think that improvement is possible from amortizing compute in the form of search. If the teacher model can differentiate between coherent and incoherent paths down the search tree of language, I think a reward model could be trained to predict the coherence of student model outputs and this reward model could be used as the training signal. I am unsure about where the reward model would be initialized from… the teacher model, random initialization, or something else entirely.
I do agree with your point that this will most likely lead to the student model exploiting the teacher model rather than robustly learning language. The “branching factor” (i.e. vocabulary size) of GPT-2 is 50,000. I imagine that the number of ways the student could explore into an observation (token) history that successfully tricks the teacher model is many times more likely than the student stumbling into a robust understanding of language. There are probably ways to mitigate this, similar to precautions taken so RLHF models don’t stray too far from the base model.
As for acquiring more data, I think the teacher model could be used to produce “new” data. This is done for Whisper-V3, which was trained on 80% data produced by Whisper-V2. How the teacher LLM generates what it knows is modulated by the temperature. It is trained with a temperature of 1, so generating data with a different temperature (and maybe a less strict top-p) could be seen as generating data on a (slightly) different distribution. Training on this new data could lead to new generation patterns without learning any new facts or knowledge.
None of this would allow the student model to gain knowledge the teacher model does not have, but I think it could allow the student model to more easily access this knowledge. I view this as the model learning to compress the observation (token) history required to approximate some hidden state. A student model that can “reach” a hidden state in 64 tokens is more powerful than one that requires 256 tokens to “reach” the same hidden state.
Will take a look at this, thank you.
This and the process @faul_sname outlined in their comment do seem like more concrete methods for eliciting knowledge from compute. Reasoning and math chains can be proven as correct or incorrect, in the same way that Go games can be won or lost, while language is much more subjective.
Something like this is what I imagined initially for the student model’s search over random token space. If someone highly intelligent (e.g. Von Neumann) could rank every output from the model in terms of coherence, I imagine it would result in a model more competent than current LLMs (at least in whatever domains Von Neumann was competent in). Obviously this is impossible, but even getting enough humans of any intelligence level to provide feedback for this process would also be impossible. This is why I fell back to relying on AI feedback for the process. This paper shows that RLAIF performs on par or better than RLHF, although I imagine RLAIF is less robust and more vulnerable to exploitation, as you mentioned. And this result is highly dependent on the domain and which human is giving the feedback.
I’m not surprised BPEs are semi-coherent. As I said, dark knowledge, and anyway, BPEs are a compression algorithm (compression=intelligence) which were trained on a large English text corpus, so them not being random linenoise is no more surprising than n-grams or gzip being able to generate English-y text.
But Whisper-V2 is processing real data still, so it’s a mix of learning from data (the Whisper models haven’t extracted all possible knowledge from the first pass through the data) and amortizing compute (the training+runtime compute of the Whisper-V2 is being distilled into cleaner pseudo-data for Whisper-V3 to train faster on). You would not generate freeform gibberish, unanchored in any real audio or text, from Whisper-V3 to train V4 and then V5 and then V6 and then V7, and expect V7 to be wildly better.
This knowledge distillation of inner-monologue can be, and has been, done directly, so detouring through a from-scratch RLAIF-ish approach would seem to offer a lot of complexity and downsides compared to just the obvious direct thing.
It is also just that there is a world outside language, while there is much less of an outside for logic, math, or Go. That’s why it’s useful to take a broader Bayesian view, so you can have an argument tree which is statistical/decision-theoretic and can do things like request empirical data. The LLM could insert arbitrary hypotheticals into the tree like “if we administer drug Y to cancer patients with Z, survival rates would be +10%”, and this can be tested in the real world (or just given an expert’s best guess, doesn’t have to actually be real to keep the search & self-improvement going—note that it could also be framed in terms of raw data, MCTS and other tree approaches can be made to work on continuous/infinite observation & action spaces, as they are iterative anytime and don’t need to expand all possible nodes).
I had this intuition for n-grams (natively) and gzip (from this paper). Never really considered how much BPE compresses the token space, not sure why.
This makes sense. This made me think whether there’d be some way to chain learning between modalities for a multimodal model, but it would probably fall into the same pit: beyond the initial data, the change in modality would still be producing and learning from synthetic data, not real data as is the case for Whisper.
I do agree that distilling inner monologue is easier than learning the same thing from scratch. I don’t think this RLAIF-from-scratch is the end-all-be-all of what’s gonna work; I find it a useful frame of thinking for considering other approaches that could work better for learning language more from scratch.
For example, this discussion with you popped the idea of using GANs into my head, which it turns out has been tried extensively. Not to the same scale as next token prediction though. DeepMind has this paper on using a GAN with LSTMs for the generator and discriminator to learn language “from scratch”. This survey paper presents other papers using GANs for text generation. Some highlights from quickly skimming through it: 1, 2, 3, 4.
This paper says (paraphrasing the abstract) that GANs are overkill for NLP since minimizing distinguishability (between generator and real outputs) can be seen as maximizing likelihood for NNs with a softmax output layer. I think that being able to define more complex loss functions with GANs is one benefit. You could use multiple discriminators: one for the pre-training data, one for a helpfulness data set, one for a harmlessness data set, etc.
Kind of as an aside, this paper connects GANs to inverse RL (e.g. learning a reward model from human feedback data), and to energy-based models (where Yann LeCun seems to think the future of self-supervised learning is going).
Good point. Maybe what I’m thinking of will only become possible once language models are more grounded in the real world. Multi-modality is a step in that direction, and robotics. We’re probably at least a few years from robots collecting enough of their own data in the real world though.
Yeah, GANs for sequences are one of those ideas that people kept trying and it never worked. It wasn’t entirely clear why; I suspect that much of it was simply that due to the inefficiency of RL and the very very smolness of all the GAN sequence work back then*, that it was all dead on arrival. (I never really bought the “it’s just equivalent to likelihood” argument. GANs always seemed to operate in images in a very qualitatively distinct way from all likelihood-based approaches; and if you look at things abstractly enough, you can make anything equivalent to anything like that.) It’s possible that retrying today with proper scale might work, same way that image GANs now work at scale (despite being left for dead by contemporary researchers who had failed to note that BigGAN scaled just fine to JFT-300M).
But my real suspicion is that direct generative learning is too efficient, so the proper role for GANs would be as an additional phase of training, to sharpen a standard LLM.
AFAIK, this has not been done except inasmuch as you interpret the various preference-learning approaches as actor-critic RL (which means you can also further interpret them as GANs). Given how well diffusion models can be tuned by a simple adversarial loss into a GAN-like single-step Generator, I suspect that some adversarial training of LLMs might be quite useful. I should poke around in Arxiv and see if anyone’s tried that yet...
* LSTM RNNs, or heck, GPTs, wouldn’t look all that impressive if they were trained with similar compute/data as those sequence GAN papers were