This is probably obvious, but maybe still worth mentioning:
It’s important to take into account the ROI per unit time. In the amount of time it would take for me to grok transformers (let’s say 100 hours), I could read ~1 million tokens, which is ~0.0002% of the training set of GPT3.
The curves aren’t clear to me, but i would bet grokking transformers would be more effective than a 0.0002% increase in training set knowledge.
This might change if you only want to predict GPT’s output in certain scenarios.
I think you would get diminishing returns but reading a few hundred thousand tokens would teach you quite a lot, and I think likely more than knowing Transformers would. I’m not convinced that Transformers are all that important (architectures seem to converge at scale, you can remove attention entirely without much damage, not much of the FLOPS is self-attention at this point, etc), but you learn a lot about why GPT-3 is the way it is if you pay attention to the data. For example, BPEs/poetry/puns: you will struggle in vain to explain the patterns of GPT strengths & weaknesses in poetry & humor with reference to the Transformer arch rather than to how the data is tokenized (a tokenization invented long before Transformers). Or the strange sorts of arbitrary-seeming output where paragraphs get duplicated or bits of vague text get put into an entire paragraph on their own, which you quickly realize reading Common Crawl are due to lossy conversion of complex often dynamic HTML into the WET ‘text’ files leading to spurious duplications or images being erased; or, the peculiar absence of Twitter from Common Crawl (because they block it). Or the multi-lingual capabilities—much more obvious once you’ve read through the X% of non-English text. Many things like why inner-monologue is not sampled by default become obvious: most answers on the Internet don’t “show their work”, and when they do, they have prefixes which look quite different than how you are prompting. You will also quickly realize “there are more things on heaven and earth than dreamt of in your philosophy”, or more contemporaneously, “the Net is vast and infinite”. (To take an example from IRC today: can GPT-4 have learned about 3D objects from VRML? I don’t see why not. There is a lot more VRML out there than you realize, because you forgot, or never knew, about the ’90s fad for VR in browsers—but the Internet hasn’t forgotten it all.) Personally, when I think back to everything I have written about GPT-3 or the scaling hypothesis, I think that most of my knowledge of Transformers/self-attention could’ve been snipped out and replaced with mislabeled knowledge about RNNs or MLP-Mixers, with little damage.
(EDIT: I also agree with Ryan that the proprietary RLHF dataset would be quite educational. I expect that if you had access to it and could read the poetry samples, the ChatGPT mode collapse would instantly cease to be a mystery, among other things.)
Does this mean hugely superior architectures to transformers (measured by benchmarking them with the same compute and data input) don’t exist or that transformers and RNNs and everything else are all close enough cousins?
The latter. I am quite certain that hugely superior architectures exist in the sense of both superior exponents and superior constants (but I’m less sure about being hugely strictly dominated on both), and these are the sorts of things that are what the whole hierarchy of meta-learning is about learning/locating; but that the current sets of architectures are all pretty much alike in being big blobs of feedforward architectures whose inductive biases wash out at what is, in absolute terms, quite small scales (so small scale we can achieve them right now with small budgets like millions to billions of dollars) as long as they achieve basic desiderata in terms of passing signals/gradients through themselves without blowing up/flatlining. DL archs fail in many different ways, but the successes are alike: ‘the AI Karenina principle’. Thus, the retrodiction that deep (>4) stacks of fully-connected layers just needed normalization to compete; my long-standing assertion that that Transformers are not special fairy-dust, self-attention not magical, and Transformers are basically better-optimized RNNs; and my (recently vindicated) prediction that despite the entire field abandoning them for the past 3-4 years because they had been ‘proven unstable’, GANs would nevertheless work well once anyone bothered to scale them up.
I don’t think researchers should learn world-facts in order to understand GPT-4.
I think that (1) researchers should use the world-facts they already know (but are actively suppressing due to learned vibe-obliviousness) to predict/explain/control GPT-4, and (2) researchers should consult a domain expert if they want to predict/explain/control GPT-4′s output on a particular prompt.
You might want to clarify that, because in the post you explicitly say things like “if your goal is to predict the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.”
if your goal is to predict the logits layer on this particular prompt, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.”
This is probably obvious, but maybe still worth mentioning:
It’s important to take into account the ROI per unit time. In the amount of time it would take for me to grok transformers (let’s say 100 hours), I could read ~1 million tokens, which is ~0.0002% of the training set of GPT3.
The curves aren’t clear to me, but i would bet grokking transformers would be more effective than a 0.0002% increase in training set knowledge.
This might change if you only want to predict GPT’s output in certain scenarios.
I think you would get diminishing returns but reading a few hundred thousand tokens would teach you quite a lot, and I think likely more than knowing Transformers would. I’m not convinced that Transformers are all that important (architectures seem to converge at scale, you can remove attention entirely without much damage, not much of the FLOPS is self-attention at this point, etc), but you learn a lot about why GPT-3 is the way it is if you pay attention to the data. For example, BPEs/poetry/puns: you will struggle in vain to explain the patterns of GPT strengths & weaknesses in poetry & humor with reference to the Transformer arch rather than to how the data is tokenized (a tokenization invented long before Transformers). Or the strange sorts of arbitrary-seeming output where paragraphs get duplicated or bits of vague text get put into an entire paragraph on their own, which you quickly realize reading Common Crawl are due to lossy conversion of complex often dynamic HTML into the WET ‘text’ files leading to spurious duplications or images being erased; or, the peculiar absence of Twitter from Common Crawl (because they block it). Or the multi-lingual capabilities—much more obvious once you’ve read through the X% of non-English text. Many things like why inner-monologue is not sampled by default become obvious: most answers on the Internet don’t “show their work”, and when they do, they have prefixes which look quite different than how you are prompting. You will also quickly realize “there are more things on heaven and earth than dreamt of in your philosophy”, or more contemporaneously, “the Net is vast and infinite”. (To take an example from IRC today: can GPT-4 have learned about 3D objects from VRML? I don’t see why not. There is a lot more VRML out there than you realize, because you forgot, or never knew, about the ’90s fad for VR in browsers—but the Internet hasn’t forgotten it all.) Personally, when I think back to everything I have written about GPT-3 or the scaling hypothesis, I think that most of my knowledge of Transformers/self-attention could’ve been snipped out and replaced with mislabeled knowledge about RNNs or MLP-Mixers, with little damage.
(EDIT: I also agree with Ryan that the proprietary RLHF dataset would be quite educational. I expect that if you had access to it and could read the poetry samples, the ChatGPT mode collapse would instantly cease to be a mystery, among other things.)
btw if anyone wants to quickly read a sample of Common Crawl, you can do it here
Does this mean hugely superior architectures to transformers (measured by benchmarking them with the same compute and data input) don’t exist or that transformers and RNNs and everything else are all close enough cousins?
The latter. I am quite certain that hugely superior architectures exist in the sense of both superior exponents and superior constants (but I’m less sure about being hugely strictly dominated on both), and these are the sorts of things that are what the whole hierarchy of meta-learning is about learning/locating; but that the current sets of architectures are all pretty much alike in being big blobs of feedforward architectures whose inductive biases wash out at what is, in absolute terms, quite small scales (so small scale we can achieve them right now with small budgets like millions to billions of dollars) as long as they achieve basic desiderata in terms of passing signals/gradients through themselves without blowing up/flatlining. DL archs fail in many different ways, but the successes are alike: ‘the AI Karenina principle’. Thus, the retrodiction that deep (>4) stacks of fully-connected layers just needed normalization to compete; my long-standing assertion that that Transformers are not special fairy-dust, self-attention not magical, and Transformers are basically better-optimized RNNs; and my (recently vindicated) prediction that despite the entire field abandoning them for the past 3-4 years because they had been ‘proven unstable’, GANs would nevertheless work well once anyone bothered to scale them up.
I don’t think researchers should learn world-facts in order to understand GPT-4.
I think that (1) researchers should use the world-facts they already know (but are actively suppressing due to learned vibe-obliviousness) to predict/explain/control GPT-4, and (2) researchers should consult a domain expert if they want to predict/explain/control GPT-4′s output on a particular prompt.
You might want to clarify that, because in the post you explicitly say things like “if your goal is to predict the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.”
okay, I’ll clarify in the article —