Regarding #5 first, I personally think that language models are being trained wrong. We’ll get OoM improvements when we stop randomizing the examples we show to models during training, and instead provide examples in a structured curriculum.
To be clear, I’m not saying that we must present easy examples first and then harder examples later. While that is what has been studied in the literature, I think we’d actually get better behavior by trying to order examples on a spectrum of “generalizes well” to “very specific, does not generalize” and then training in that order. Sometimes this might be equivalent to “easy examples first”, but that isn’t necessarily true.
I recognize that the definitions of “easy” and “generalizes” are nebulous, so I’m going to try and explain the reasoning that led me here.
Consider the architecture of transformers and feed-forward neural networks (specifically not recurrent neural networks). We’re given some input, and we produce some output. In a model like GPT, we’re auto-regressive, so as we produce our outputs, those outputs become part of the input during the next step. Each step is fundamentally a function F(S1)−>S2.
Given some input, the total output can be thought as:
def reply_to(input):
output = “”
while True:
token = predict_next(input + output)
if token == STOP:
break
output += token
return output
We’d like to know exactly what `predict_next` is doing, but unfortunately, the programmer who wrote it seems to have done their implementation entirely in matrix math and they didn’t include any comments. In other words, it’s deeply cursed and not terribly different from the output of Simulink’s code generator.
def predict_next(input):
# … matrix math
return output
Let’s try to think about the capabilities and constraints on this function.
There is no unbounded `loop` construct. The best we can do is approximate loops, e.g. by supporting an unrolled loop up to some bounded number of iterations. What determines the bounds? Probably the depth of the network?
If the programmer were sufficiently deranged, they could implement `predict_next` in such a way that if they’ve hit the bottom of their unrolled loop, they could rely on the fact that `predict_next` will be called again, and continue their previous calculations during the next call. What would be the limitations on this? Probably the size of each hidden layer. If you wanted to figure out if this is happening, you’d want to look for prompts where the network can answer the prompt correctly if it is allowed to generate text before the answer (e.g. step-by-step explanations) but is unable to do so if asked to provide the answer without any associated explanations.
How many total “instructions” can fit into this function? The size of the network seems like a decent guess. Unfortunately, the network conflates instructions and data, and the network must use all parameters available to it. This leads to trivial solutions where the network just over-fits to the data (analogous to baking in a lookup table on the stack). It’s not unsurprising that throwing OoM more data at a fixed size NN results in better generalization. Once you’re unable to cheat with over-fitting you must learn algorithms that work more efficiently.
The reason why I’m discussing the network in the language of instructions, stack space, and loops is because I disagree with a blanket statement like “scale is all you need”. I think it’s obvious that scaling the neural network is a patch on the first two constraints, and scaling the training data is a patch on the third constraint.
This is also why I think that point #3 is relevant. If GPT-3 does so well because it’s using the sea of parameters for unrolled loops, then something like Stable Diffusion at 1/200th the size probably makes sense.
To tie this back to point #5:
We start with a giant corpus of data. On the order of “all written content available in digital form”. We might generate additional data in an automated fashion, or digitize books, or caption videos.
We divide it into training data and test data.
We train the network on random examples from the training data, and then verify on random examples from the test data. For simplicity, I’m glossing over various training techniques like masking data or connections between nodes.
Then we fine-tune it, e.g with Q&A examples.
And then generally we deploy it with some prompt engineering, e.g. prefixing queries with past transcript history, to fake a conversation.
At the end of this process, what do we have?
I want to emphasize that I do not think it is a “stochastic parrot”. I think it is very obvious that the final system has internalized actual algorithms (or at least, pseudo-algorithms due to the limitation on loops) for various tasks, given the fact that the size of the data set is significantly larger than the size of the model. I think people who are surprised by the capabilities of these systems continue to assume it is “just” modeling likelihoods, when there was no actual requirement on that.
I also suspect we’ve wasted an enormous quantity of our parameters on embedding knowledge that does not directly contribute to system’s capabilities.
I think we could train a LLM on a minimal corpus to “teach” a language[1] and then place that LLM inside of a larger system that we train to minimize loss on examples teaching logic, mathematics, and other components of reasoning. That larger system would distinguish between the weights for the algorithms it learns and the weights representing embedded knowledge. It would also have the capability to loop during the generation of an output. For comparison, think of the experiments being done with hooking up GPT-4 to a vector database, but now do that inside of the architecture instead of as a hack on top of the text prompts.
I think an architecture that cleanly separates embedded knowledge (“facts”, “beliefs”, “shards”, etc) from the algorithms (“capabilities”, “zero-shot learning”) is core to designing a neural network that remains interpretable and alignable at scale.
If you read the previous paragraphs and think, “that sounds familiar”, it’s probably because I’m describing how we teach humans: first language, then reasoning, then specialization. A curriculum. We need language first because we want to be able to show examples, explain, and correct mistakes. Especially since we can automate content generation with existing LLMs to create the training corpus in these steps. Then we want to teach reasoning, starting with the most general forms of reasoning, and working into the most specific. Finally, we grade the system (not train!) on a corpus of specific knowledge-based activities. Think of this step as describing the rules of a made-up game, providing the current game state, and then asking for the optimal move. Except that for games, for poems, for math, for wood working, for engineering, etc. The whole point of general intelligence is that you can reason from first principles, so that’s what we need to be grading the network on: minimizing loss with respect to arbitrarily many knowledge-based tasks that must be solved using the facts provided only during the test itself.
That’s a fair criticism, but why would it apply to only language models? We also train visual models with a randomized curriculum, and we seem to get much better results. Why would randomization hurt training efficiency for language generation but not image generation?
First, when we say “language model” and then we talk about the capabilities of that model for “standard question answering and factual recall tasks”, I worry that we’ve accidentally moved the goal posts on what a “language model” is.
Originally, a language model was a stochastic parrot. They were developed to answer questions like “given these words, what comes next?” or “given this sentence, with this unreadable word, what is the most likely candidate?” or “what are the most common words?”[1] It was not a problem that required deep learning.
Then, we applied deep learning to it, because the path of history so far has been to take straightforward algorithms, replace them with a neural network, and see what happens. From that, we got … stochastic parrots! Randomizing the data makes perfect sense for that.
Then, we scaled it. And we scaled it more. And we scaled it more.
And now we’ve arrived at a thing we keep calling a “language model” due to history, but it isn’t a stochastic parrot anymore.
Second, I’m not saying “don’t randomize data”, I’m saying “use a tiered approach to training”. We would use all of the same techniques: randomization, masking, adversarial splits, etc. What we would not do is throw all of our data and all of our parameters into a single, monolithic model and expect that would be efficient.[2] Instead, we’d first train a “minimal” LLM, then we’d use that LLM as a component within a larger NN, and we’d train that combined system (LLM + NN) on all of the test cases we care about for abstract reasoning / problem solving / planning / etc. It’s that combined system that I think would end up being vastly more efficient than current language models, because I suspect the majority of language model parameters are being used for embedding trivia that doesn’t contribute to the core capabilities we recognize as “general intelligence”.
This wasn’t for auto-complete, it was generally for things like automatic text transcription from images, audio, or videos. Spam detection was another use-case.
Recall that I’m trying to offer a hypothesis for why a system like GPT-3.5 takes so much training and has so many parameters and it still isn’t “competent” in all of the ways that a human is competent. I think “it is being trained in an inefficient way” is a reasonable answer to that question.
Okay, that’s all fair, but it still doesn’t answer my question. We don’t do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?
Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?
We don’t do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?
I’m not sure that “even higher level of output quality” is actually true, but I recognize that it can be difficult to judge when an image generation model has succeeded. In particular, I think current image models are fairly bad at specifics in much the same way as early language models.
But I think the real problem is that we seem to still be stuck on “words”. When I ask GPT-4 a logic question, and it produces a grammatically correct sentence that answers the logic puzzle correctly, only part of that is related to “words”—the other part is a nebulous blob of reasoning.
I went all the way back to GPT-1 (117 million parameters) and tested next word prediction—specifically, I gave a bunch of prompts, and I looked for only if the very next word was what I would have expected. I think it’s incredibly good at that! Probably better than most humans.
Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?
No, because this is already how image generators work. That’s what I said in my first post when I noted the architectural differences between image generators and language models. An image generator, as a system, consists of multiple models. There is a text → image space, and then an image space → image. The text → image space encoder is generally trained first, then it’s normally frozen during the training of the image decoder.[1] Meanwhile, the image decoder is trained on a straightforward task: “given this image, predict the noise that was added”. In the actual system, that decoder is put into a loop to generate the final result. I’m requoting the relevant section of my first post below:
The reason why I’m discussing the network in the language of instructions, stack space, and loops is because I disagree with a blanket statement like “scale is all you need”. I think it’s obvious that scaling the neural network is a patch on the first two constraints, and scaling the training data is a patch on the third constraint.
This is also why I think that point #3 is relevant. If GPT-3 does so well because it’s using the sea of parameters for unrolled loops, then something like Stable Diffusion at 1/200th the size probably makes sense.
The trick here is that they decoupled the encoding from training the diffusion model. That way, the autoencoder can be trained to get the best image representation and then downstream several diffusion models can be trained on the so-called latent representation
This is the idea that I’m saying could be applied to language models, or rather, to a thing that we want to demonstrate “general intelligence” in the form of reasoning / problem solving / Q&A / planning / etc. First train a LLM, then train a larger system with the LLM as a component within it.
Start with bad data, train a bad model. It’s bad but it’s still good enough to rank your training data. Now you have better training data. Train a better model. The architecture is different of course, but is there anything analogous?
Yes, it’s my understanding that OpenAI did this for GPT-4. It’s discussed in the system card PDF. They used early versions of GPT-4 to generate synthetic test data and also as an evaluator of GPT-4 responses.
I suspect it is a combination of #3 and #5.
Regarding #5 first, I personally think that language models are being trained wrong. We’ll get OoM improvements when we stop randomizing the examples we show to models during training, and instead provide examples in a structured curriculum.
This isn’t a new thought, e.g. https://arxiv.org/abs/2101.10382
To be clear, I’m not saying that we must present easy examples first and then harder examples later. While that is what has been studied in the literature, I think we’d actually get better behavior by trying to order examples on a spectrum of “generalizes well” to “very specific, does not generalize” and then training in that order. Sometimes this might be equivalent to “easy examples first”, but that isn’t necessarily true.
I recognize that the definitions of “easy” and “generalizes” are nebulous, so I’m going to try and explain the reasoning that led me here.
Consider the architecture of transformers and feed-forward neural networks (specifically not recurrent neural networks). We’re given some input, and we produce some output. In a model like GPT, we’re auto-regressive, so as we produce our outputs, those outputs become part of the input during the next step. Each step is fundamentally a function F(S1)−>S2.
Given some input, the total output can be thought as:
We’d like to know exactly what `predict_next` is doing, but unfortunately, the programmer who wrote it seems to have done their implementation entirely in matrix math and they didn’t include any comments. In other words, it’s deeply cursed and not terribly different from the output of Simulink’s code generator.
Let’s try to think about the capabilities and constraints on this function.
There is no unbounded `loop` construct. The best we can do is approximate loops, e.g. by supporting an unrolled loop up to some bounded number of iterations. What determines the bounds? Probably the depth of the network?
If the programmer were sufficiently deranged, they could implement `predict_next` in such a way that if they’ve hit the bottom of their unrolled loop, they could rely on the fact that `predict_next` will be called again, and continue their previous calculations during the next call. What would be the limitations on this? Probably the size of each hidden layer. If you wanted to figure out if this is happening, you’d want to look for prompts where the network can answer the prompt correctly if it is allowed to generate text before the answer (e.g. step-by-step explanations) but is unable to do so if asked to provide the answer without any associated explanations.
How many total “instructions” can fit into this function? The size of the network seems like a decent guess. Unfortunately, the network conflates instructions and data, and the network must use all parameters available to it. This leads to trivial solutions where the network just over-fits to the data (analogous to baking in a lookup table on the stack). It’s not unsurprising that throwing OoM more data at a fixed size NN results in better generalization. Once you’re unable to cheat with over-fitting you must learn algorithms that work more efficiently.
The reason why I’m discussing the network in the language of instructions, stack space, and loops is because I disagree with a blanket statement like “scale is all you need”. I think it’s obvious that scaling the neural network is a patch on the first two constraints, and scaling the training data is a patch on the third constraint.
This is also why I think that point #3 is relevant. If GPT-3 does so well because it’s using the sea of parameters for unrolled loops, then something like Stable Diffusion at 1/200th the size probably makes sense.
To tie this back to point #5:
We start with a giant corpus of data. On the order of “all written content available in digital form”. We might generate additional data in an automated fashion, or digitize books, or caption videos.
We divide it into training data and test data.
We train the network on random examples from the training data, and then verify on random examples from the test data. For simplicity, I’m glossing over various training techniques like masking data or connections between nodes.
Then we fine-tune it, e.g with Q&A examples.
And then generally we deploy it with some prompt engineering, e.g. prefixing queries with past transcript history, to fake a conversation.
At the end of this process, what do we have?
I want to emphasize that I do not think it is a “stochastic parrot”. I think it is very obvious that the final system has internalized actual algorithms (or at least, pseudo-algorithms due to the limitation on loops) for various tasks, given the fact that the size of the data set is significantly larger than the size of the model. I think people who are surprised by the capabilities of these systems continue to assume it is “just” modeling likelihoods, when there was no actual requirement on that.
I also suspect we’ve wasted an enormous quantity of our parameters on embedding knowledge that does not directly contribute to system’s capabilities.
My hypothesis for how to fix this is vaguely similar to the idea of “maximizing divergence” discussed here https://ljvmiranda921.github.io/notebook/2022/08/02/splits/.
I think we could train a LLM on a minimal corpus to “teach” a language[1] and then place that LLM inside of a larger system that we train to minimize loss on examples teaching logic, mathematics, and other components of reasoning. That larger system would distinguish between the weights for the algorithms it learns and the weights representing embedded knowledge. It would also have the capability to loop during the generation of an output. For comparison, think of the experiments being done with hooking up GPT-4 to a vector database, but now do that inside of the architecture instead of as a hack on top of the text prompts.
I think an architecture that cleanly separates embedded knowledge (“facts”, “beliefs”, “shards”, etc) from the algorithms (“capabilities”, “zero-shot learning”) is core to designing a neural network that remains interpretable and alignable at scale.
If you read the previous paragraphs and think, “that sounds familiar”, it’s probably because I’m describing how we teach humans: first language, then reasoning, then specialization. A curriculum. We need language first because we want to be able to show examples, explain, and correct mistakes. Especially since we can automate content generation with existing LLMs to create the training corpus in these steps. Then we want to teach reasoning, starting with the most general forms of reasoning, and working into the most specific. Finally, we grade the system (not train!) on a corpus of specific knowledge-based activities. Think of this step as describing the rules of a made-up game, providing the current game state, and then asking for the optimal move. Except that for games, for poems, for math, for wood working, for engineering, etc. The whole point of general intelligence is that you can reason from first principles, so that’s what we need to be grading the network on: minimizing loss with respect to arbitrarily many knowledge-based tasks that must be solved using the facts provided only during the test itself.
Is English the right language to teach? I think it would be funny if a constructed language actually found a use here.
That’s a fair criticism, but why would it apply to only language models? We also train visual models with a randomized curriculum, and we seem to get much better results. Why would randomization hurt training efficiency for language generation but not image generation?
First, when we say “language model” and then we talk about the capabilities of that model for “standard question answering and factual recall tasks”, I worry that we’ve accidentally moved the goal posts on what a “language model” is.
Originally, a language model was a stochastic parrot. They were developed to answer questions like “given these words, what comes next?” or “given this sentence, with this unreadable word, what is the most likely candidate?” or “what are the most common words?”[1] It was not a problem that required deep learning.
Then, we applied deep learning to it, because the path of history so far has been to take straightforward algorithms, replace them with a neural network, and see what happens. From that, we got … stochastic parrots! Randomizing the data makes perfect sense for that.
Then, we scaled it. And we scaled it more. And we scaled it more.
And now we’ve arrived at a thing we keep calling a “language model” due to history, but it isn’t a stochastic parrot anymore.
Second, I’m not saying “don’t randomize data”, I’m saying “use a tiered approach to training”. We would use all of the same techniques: randomization, masking, adversarial splits, etc. What we would not do is throw all of our data and all of our parameters into a single, monolithic model and expect that would be efficient.[2] Instead, we’d first train a “minimal” LLM, then we’d use that LLM as a component within a larger NN, and we’d train that combined system (LLM + NN) on all of the test cases we care about for abstract reasoning / problem solving / planning / etc. It’s that combined system that I think would end up being vastly more efficient than current language models, because I suspect the majority of language model parameters are being used for embedding trivia that doesn’t contribute to the core capabilities we recognize as “general intelligence”.
This wasn’t for auto-complete, it was generally for things like automatic text transcription from images, audio, or videos. Spam detection was another use-case.
Recall that I’m trying to offer a hypothesis for why a system like GPT-3.5 takes so much training and has so many parameters and it still isn’t “competent” in all of the ways that a human is competent. I think “it is being trained in an inefficient way” is a reasonable answer to that question.
Okay, that’s all fair, but it still doesn’t answer my question. We don’t do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?
Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?
I’m not sure that “even higher level of output quality” is actually true, but I recognize that it can be difficult to judge when an image generation model has succeeded. In particular, I think current image models are fairly bad at specifics in much the same way as early language models.
But I think the real problem is that we seem to still be stuck on “words”. When I ask GPT-4 a logic question, and it produces a grammatically correct sentence that answers the logic puzzle correctly, only part of that is related to “words”—the other part is a nebulous blob of reasoning.
I went all the way back to GPT-1 (117 million parameters) and tested next word prediction—specifically, I gave a bunch of prompts, and I looked for only if the very next word was what I would have expected. I think it’s incredibly good at that! Probably better than most humans.
No, because this is already how image generators work. That’s what I said in my first post when I noted the architectural differences between image generators and language models. An image generator, as a system, consists of multiple models. There is a text → image space, and then an image space → image. The text → image space encoder is generally trained first, then it’s normally frozen during the training of the image decoder.[1] Meanwhile, the image decoder is trained on a straightforward task: “given this image, predict the noise that was added”. In the actual system, that decoder is put into a loop to generate the final result. I’m requoting the relevant section of my first post below:
Refer to figure 2 in https://cdn.openai.com/papers/dall-e-2.pdf. Or read this:
This is the idea that I’m saying could be applied to language models, or rather, to a thing that we want to demonstrate “general intelligence” in the form of reasoning / problem solving / Q&A / planning / etc. First train a LLM, then train a larger system with the LLM as a component within it.
Might feel validated by this—https://arxiv.org/abs/2305.07759
Are people doing anything in LLMs like the classic StyleGAN training data bootstrapping pattern?
Start with bad data, train a bad model. It’s bad but it’s still good enough to rank your training data. Now you have better training data. Train a better model. The architecture is different of course, but is there anything analogous?
Yes, it’s my understanding that OpenAI did this for GPT-4. It’s discussed in the system card PDF. They used early versions of GPT-4 to generate synthetic test data and also as an evaluator of GPT-4 responses.