So, it seems that scaling up isn’t the only thing that matters, and data quality can be more important than data quantity or parameter count. (You hear that, gwern?)
Apparently someone didn’t actually read my scaling hypothesis essay (specifically, the parts about why pretraining works and the varieties of blessings of scale). I have been pointing out for a long time that NNs are overparameterized and almost all training data is useless (which is a big part of why RL will be important, because RL lets you make the right data, or see meta-learning or dataset distillation or active-learning or core-sets...), and the reason scaling works is that all the prior approaches throw the baby out with the bathwater—the bitter lesson of Internet scrapes working so well is that we were too stupid to handengineer selecting only the right data, so if you admit defeat and train on all the data, at least you can’t screw that up. The optimal amount of data, whether natural or synthetic, you need to train an AGI will be many orders of magnitude smaller than the amount the first training run will actually use; this is one of the most important overhangs.
We are getting smarter about choosing the right data, but it’s far from obvious that we’re smart enough yet...
I would mention “The False Promise of Imitating Proprietary LLMs” as as an example of this: people thought they could take a quick cheap shortcut and clone GPT-3.5/4 by behavior-cloning its outputs, but they were operating under false premises—RLHF doesn’t magically teach a model lots of things it didn’t know before, all RLHF does is collapse the POMDP by hardwiring a few specific latent values ie. specialize a model down to things it already knows & provides a more user-friendly interface to what a model could already do. So, you get a sugar-rush from imitation-learning on RLHFed models which optimizes it for the benchmarks, but you don’t get the general broad capabilities you actually wanted, and that’s why—for all the goldrush hype declaring small open models king and how OA has no moat etc—you don’t see people using them all that much or in interesting ways outside of the benchmark tasks which no one cares about.
(This is the same mistake people made with GPT-3: “what benchmarks miss” is the broad flexible generalization and universal knowledge, and so if you say “GPT-3 isn’t important because smaller specialized finetuned models match its zero-shot”, or if you say “LLaMA is important because it’s a small specialized finetuned model which matches GPT-3.5′s zero-shot”, those benchmark numbers may be true, but they’re still irrelevant to what users really want.)
So, we’ll see. But at least training on textbooks is more plausible in terms of eliciting ‘what benchmarks miss’ than training on kindergarten-level fiction stories!
I know how to use google. The linked tweet is gwern liking MLP-mixer even more than transformers, because it’s more simpler and all you need is further-simplified architecture plus even more scale. But a couple years later, ViTs seem to be a lot better than that, and hybrid convolutional-transformer systems are more efficient than those.
I don’t know how that’s relevant. Liking MLP-Mixers doesn’t show that I think that datasets right now are optimal-sized and cannot be made much smaller, nor does it show that I didn’t argue the latter when this was a big part of my Tool AI essay and my explanation for why GPT-3 pretraining could work.
But, since you want to bring it up: I stand by that tweet. What I said then remains true today, as far as I know:
(And no one’s even tried redoing any of this with the latest hotness, simple efficient dense MLPs with some simple hierarchical structure a la MLP-Mixer.)
Arguments from silence are only compelling if there ought to be a lot of noise.
Nor am I particularly worried that it’s been all of 2 years and we haven’t thrown out all the Transformers in favor of some more MLP-esque architecture:
architecture changes, as obvious and simple as they may seem in hindsight, can take an awful long time.
For example, the architectural tweaks that made deep fully-connected archs work and brought stuff like MLP-Mixer back to the mainstream, despite being trivial on the level of ‘divide by a constant’, nevertheless took something like 7 years to be invented after the early studies showing ‘fully-connected layers don’t scale’. This is pretty quick compared to many things—residual layers have been around since ~1988 before their 2014 reinvention, and most of the Bitter Lesson examples took decades. So, I’ll start worrying in about, oh say, a decade. (A better counterargument here would be, ‘perhaps they’ll win in the long run, but in the long run, we’re all dead’.)
there is no strong evidence against MLP-style approaches thus far; there have been no airtight theoretical proofs nor large-scale empirical benchmarkings showing them flatlining.
The available scaling laws, in fact, look pretty similar, like in Tay et al 2022. Considering how vastly less effort has gone into MLP-Mixers, to the point where Tay et al 2022 has to benchmark it only on encoders “since Mixers have not been used in autoregressive decoding”, and how we know that bad hyperparameters & simple arch flaws can destroy scaling, I consider it quite promising that its scaling curve already looks reasonable and beats some of the others.
I would also note that this is what you would expect of a more general architecture: starting off with a worse constant, but similar or better scaling, which means people neglect it in favor of what currently scores the highest on benchmarks (like ViT). Nothing new there! That’s how Transformers for images worked back when I started heretically suggesting CNNs might be eventually beaten by Transformer-CNN hybrids and someday soon, even all-Transformer image models, as the CNN inductive bias will wash out and we eventually start using multi-billion-parameter & n image models. So if you are going to cite ViT, I would remind you that the idea of a ViT could’ve been criticized for exactly the same reasons c. 2019: “it’s been 2 years after Vaswani, but CNNs, and hybrid-convolutional-Transformer systems, seem to be a lot better and more efficient than Transformers”...
there is extensive lockin and lack of experimentation with variants at scale.
Look at how everyone is still using BPEs, despite the disaster they have been for text2image models and their increasingly bizarre consequences like GPT-3 models that cannot genuinely rhyme but also will refuses to write non-rhyming poems. If OA can’t even move away from BPEs, then I’m not holding my breath for bigger novelties. (Stuff doesn’t just happen on its own, you know. Someone has to do it.)
as Transformers scale, they become ever more just ‘MLP archs with some self-attention’, and yet, they continue to work well and scale as usual. This should trouble people who believe self-attention is magic!
Roller points out that in OPT, the FLOPS spent in a model goes from being 40% MLP at GPT-2 to 80% MLP at GPT-3 scale. Presumably if you kept scaling up Transformers, the self-attention would just keep getting smaller… which raises the question of, in the long run, do you really need any explicit self-attention, or is whatever self-attention does mostly useful at small model sizes, and can be done by a more uniform large architecture?
I’d also point out the scaling studies of Nie et al 2021, where self-attention outperforms MLPs… but not by much, decreasing by scale, and even a tiny amount of self-attention essentially closes the gap. (And Liu et al 2021.)
as Transformers scale, they become ever more just ‘MLP archs with some self-attention’, and yet, they continue to work well and scale as usual. This should trouble people who believe self-attention is magic!...I’d also point out the scaling studies of Nie et al 2021, where self-attention outperforms MLPs… but not by much, decreasing by scale, and even a tiny amount of self-attention essentially closes the gap. (And Liu et al 2021.)
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks.
We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation.
Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these “attentionless Transformers” to rival the performance of the original architecture.
Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.
The initial point of our procedure is the training of the vanilla Transformer model, which consists of six encoders and six decoders. To reduce training times and make testing faster, we reduced the embedding size from 512 to 128. These changes did not drop the overall score too much below the original BLEU score but resulted in significantly lower computational power demands. This Transformer model was then used as a teacher model for training the feedforward networks
...
We also adopted a fixed upper bound to sentence length, which we set to 50
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.
there is no strong evidence against MLP-style approaches thus far
People have tried them. You just don’t get published unless you show progress.
Look at how everyone is still using BPEs, despite
You think you know something about tokenizers that OpenAI et al don’t, huh? Yes, current tokenizers have some problems, but I can tell you why they were used instead of something simpler: because the overall performance was better. Perhaps something like Meta’s MegaByte will replace them, but that’s not a design you’d suggested.
is whatever self-attention does mostly useful at small model sizes, and can be done by a more uniform large architecture?
I know what the self-attention does and the answer is “no”. I will not be posting an explanation until something close enough and not too obscure is published.
ViTs aren’t increased architecture complexity compared to what they replaced.
People have tried them. You just don’t get published unless you show progress.
I see.
You think you know something about tokenizers that OpenAI et al don’t, huh?
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams. Other people are ignorant too; even computer poetry people don’t know it, eg in April Cynthia Rudin’s comments on her old GPT poetry research shows she still doesn’t understand why GPT-2 wouldn’t rhyme or why she’s also wrong when she claims ChatGPT can rhyme (or, for that matter, why the oddities of GPT poetry do not provide a good justification of the need for her interpretability work: because I didn’t need any interpretability research to figure out BPEs were the problem, her methods have not figured it out yet, and it’s far from obvious that any of her interpretability techniques would’ve diagnosed the problem given more work put into them).
And this is normal. Inventing something doesn’t mean you know everything about it. You shouldn’t be surprised that users of OA’s stuff learn things before OA; why would OA know as much about GPT poetry as I do? No one there researches GPT poetry. So it’s no surprise that they weren’t looking for reasons why GPT-2/3 couldn’t rhyme. More importantly than that, OA has been as surprised as anyone by things like inner-monologue. OA doesn’t know many things about its models, like the reason for the ‘unspeakable tokens’ or why DALL-E 2 couldn’t do anime*. Or consider how CLIP or ChatGPT took off. No, OA is certainly not omniscient. (Which is part of why OA has been talking about revisiting tokenization in order to better support foreign languages, which are harshly punished by current BPEs: even if you massively expand the BPE vocab to a million, you’re still handling poorly synthetic languages.)
Perhaps something like Meta’s MegaByte will replace them, but that’s not a design you’d suggested.
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants, other than tailoring it slightly to byte encoding, and those have all proven to be flawed on LRA and/or scale poorly (eg Performer in that Tay paper before). If I wanted to point to a good proof of concept, I’d point to ByT5 still and also the text2image work showing how the economically-valuable task of generating images with arbitrary text is sabotaged by non-byte encodings even at PaLM scale. Which Google didn’t know even though they invented the image models in question. :thinking_face:
Byte encoding is my favored encoding in the long run and which I do expect to eventually take over, but I don’t know what the architecture is going to look like there or if byte-level tokenization will even require a ‘solution’ to giant context windows. Something else may win there, like a retrieval mechanism or going back to recurrency in a Perceiver-esque way.
Because the overall performance was better.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
I know what the self-attention does and the answer is “no”. I will not be posting an explanation until something close enough and not too obscure is published.
I see.
* DALL-E 2 anime was another case where OA insiders blew me off until I compiled enough cases that they had to admit there was something anomalous there and seem to have done a little bit about it given an early upgrade and then DALL-2-exp. Unfortunately, they never confirmed whether my CLIP-censoring hypothesis was correct. On the bright side, at least the DALL-E 2 paper acknowledged the damage done by BPEs.
ViTs aren’t increased architecture complexity compared to what they replaced.
Sorry, I wasn’t clear enough, or maybe I misunderstood your position.
I saw you liked MLP-mixer type designs because they’re simpler, and per your tweets and comment above, you seem to think larger models should need less complexity.
I consider complex network designs for training and for inference to be the same type of complexity, and “textbooks are all you need” is then evidence for more structural complexity being helpful.
You clarified things a bit here: apparently the basis for your position is that “architectural complexity is generally just inductive bias that becomes less important as more-flexible capabilities increase”. I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Depending on the details of your position, “textbooks are all you need” may or may not be evidence against it—do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams.
Karpathy seemed to understand the issues. He went to OpenAI from Tesla somewhat recently, but I’d include Tesla in “OpenAI et al” and you were implying that all of those major AI labs (“everyone”) didn’t understand that.
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants
eg Performer in that Tay paper
This Tay paper? I see a bunch of sparse attention approaches in that survey, but I don’t see what MegaByte does in there. Maybe I missed an earlier paper, but the Performer design is completely different. MegaByte is certainly simpler than much of that stuff, but I guess people were looking in the wrong direction.
Byte encoding is my favored encoding in the long run
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative “less inductive bias is better”, popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, being completely free of any inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects.
We show that the performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.
Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they woulddofineif scaled up.)
and “textbooks are all you need” is then evidence for more structural complexity being helpful.
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
Karpathy seemed to understand the issues.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
Maybe I missed an earlier paper, but the Performer design is completely different.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
Byte encoding works well with non MegaByte type designs too...
this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person!
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
I see; somehow I thought he was smarter than that.
(Wow, way to just throw Karpathy under the bus.)
The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
I know what the self-attention does and the answer is “no”. I will not be posting an explanation until something close enough and not too obscure is published.
Apparently someone didn’t actually read my scaling hypothesis essay (specifically, the parts about why pretraining works and the varieties of blessings of scale). I have been pointing out for a long time that NNs are overparameterized and almost all training data is useless (which is a big part of why RL will be important, because RL lets you make the right data, or see meta-learning or dataset distillation or active-learning or core-sets...), and the reason scaling works is that all the prior approaches throw the baby out with the bathwater—the bitter lesson of Internet scrapes working so well is that we were too stupid to handengineer selecting only the right data, so if you admit defeat and train on all the data, at least you can’t screw that up. The optimal amount of data, whether natural or synthetic, you need to train an AGI will be many orders of magnitude smaller than the amount the first training run will actually use; this is one of the most important overhangs.
We are getting smarter about choosing the right data, but it’s far from obvious that we’re smart enough yet...
I would mention “The False Promise of Imitating Proprietary LLMs” as as an example of this: people thought they could take a quick cheap shortcut and clone GPT-3.5/4 by behavior-cloning its outputs, but they were operating under false premises—RLHF doesn’t magically teach a model lots of things it didn’t know before, all RLHF does is collapse the POMDP by hardwiring a few specific latent values ie. specialize a model down to things it already knows & provides a more user-friendly interface to what a model could already do. So, you get a sugar-rush from imitation-learning on RLHFed models which optimizes it for the benchmarks, but you don’t get the general broad capabilities you actually wanted, and that’s why—for all the goldrush hype declaring small open models king and how OA has no moat etc—you don’t see people using them all that much or in interesting ways outside of the benchmark tasks which no one cares about.
(This is the same mistake people made with GPT-3: “what benchmarks miss” is the broad flexible generalization and universal knowledge, and so if you say “GPT-3 isn’t important because smaller specialized finetuned models match its zero-shot”, or if you say “LLaMA is important because it’s a small specialized finetuned model which matches GPT-3.5′s zero-shot”, those benchmark numbers may be true, but they’re still irrelevant to what users really want.)
So, we’ll see. But at least training on textbooks is more plausible in terms of eliciting ‘what benchmarks miss’ than training on kindergarten-level fiction stories!
Your “scaling hypothesis essay”? I was thinking of stuff like this thread.
https://gwern.net/scaling-hypothesis
I know how to use google. The linked tweet is gwern liking MLP-mixer even more than transformers, because it’s more simpler and all you need is further-simplified architecture plus even more scale. But a couple years later, ViTs seem to be a lot better than that, and hybrid convolutional-transformer systems are more efficient than those.
I don’t know how that’s relevant. Liking MLP-Mixers doesn’t show that I think that datasets right now are optimal-sized and cannot be made much smaller, nor does it show that I didn’t argue the latter when this was a big part of my Tool AI essay and my explanation for why GPT-3 pretraining could work.
But, since you want to bring it up: I stand by that tweet. What I said then remains true today, as far as I know:
Arguments from silence are only compelling if there ought to be a lot of noise. Nor am I particularly worried that it’s been all of 2 years and we haven’t thrown out all the Transformers in favor of some more MLP-esque architecture:
architecture changes, as obvious and simple as they may seem in hindsight, can take an awful long time.
For example, the architectural tweaks that made deep fully-connected archs work and brought stuff like MLP-Mixer back to the mainstream, despite being trivial on the level of ‘divide by a constant’, nevertheless took something like 7 years to be invented after the early studies showing ‘fully-connected layers don’t scale’. This is pretty quick compared to many things—residual layers have been around since ~1988 before their 2014 reinvention, and most of the Bitter Lesson examples took decades. So, I’ll start worrying in about, oh say, a decade. (A better counterargument here would be, ‘perhaps they’ll win in the long run, but in the long run, we’re all dead’.)
there is no strong evidence against MLP-style approaches thus far; there have been no airtight theoretical proofs nor large-scale empirical benchmarkings showing them flatlining.
The available scaling laws, in fact, look pretty similar, like in Tay et al 2022. Considering how vastly less effort has gone into MLP-Mixers, to the point where Tay et al 2022 has to benchmark it only on encoders “since Mixers have not been used in autoregressive decoding”, and how we know that bad hyperparameters & simple arch flaws can destroy scaling, I consider it quite promising that its scaling curve already looks reasonable and beats some of the others.
I would also note that this is what you would expect of a more general architecture: starting off with a worse constant, but similar or better scaling, which means people neglect it in favor of what currently scores the highest on benchmarks (like ViT). Nothing new there! That’s how Transformers for images worked back when I started heretically suggesting CNNs might be eventually beaten by Transformer-CNN hybrids and someday soon, even all-Transformer image models, as the CNN inductive bias will wash out and we eventually start using multi-billion-parameter & n image models. So if you are going to cite ViT, I would remind you that the idea of a ViT could’ve been criticized for exactly the same reasons c. 2019: “it’s been 2 years after Vaswani, but CNNs, and hybrid-convolutional-Transformer systems, seem to be a lot better and more efficient than Transformers”...
there is extensive lockin and lack of experimentation with variants at scale.
Look at how everyone is still using BPEs, despite the disaster they have been for text2image models and their increasingly bizarre consequences like GPT-3 models that cannot genuinely rhyme but also will refuses to write non-rhyming poems. If OA can’t even move away from BPEs, then I’m not holding my breath for bigger novelties. (Stuff doesn’t just happen on its own, you know. Someone has to do it.)
as Transformers scale, they become ever more just ‘MLP archs with some self-attention’, and yet, they continue to work well and scale as usual. This should trouble people who believe self-attention is magic!
Roller points out that in OPT, the FLOPS spent in a model goes from being 40% MLP at GPT-2 to 80% MLP at GPT-3 scale. Presumably if you kept scaling up Transformers, the self-attention would just keep getting smaller… which raises the question of, in the long run, do you really need any explicit self-attention, or is whatever self-attention does mostly useful at small model sizes, and can be done by a more uniform large architecture?
I’d also point out the scaling studies of Nie et al 2021, where self-attention outperforms MLPs… but not by much, decreasing by scale, and even a tiny amount of self-attention essentially closes the gap. (And Liu et al 2021.)
In the other direction, MLP layers can be trained to imitate realistic self-attention layers with high accuracy: “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers”, Bozic et al 2023:
I’m not impressed by that paper:
...
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.
It’s increasing architecture complexity.
People have tried them. You just don’t get published unless you show progress.
You think you know something about tokenizers that OpenAI et al don’t, huh? Yes, current tokenizers have some problems, but I can tell you why they were used instead of something simpler: because the overall performance was better. Perhaps something like Meta’s MegaByte will replace them, but that’s not a design you’d suggested.
I know what the self-attention does and the answer is “no”. I will not be posting an explanation until something close enough and not too obscure is published.
ViTs aren’t increased architecture complexity compared to what they replaced.
I see.
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams. Other people are ignorant too; even computer poetry people don’t know it, eg in April Cynthia Rudin’s comments on her old GPT poetry research shows she still doesn’t understand why GPT-2 wouldn’t rhyme or why she’s also wrong when she claims ChatGPT can rhyme (or, for that matter, why the oddities of GPT poetry do not provide a good justification of the need for her interpretability work: because I didn’t need any interpretability research to figure out BPEs were the problem, her methods have not figured it out yet, and it’s far from obvious that any of her interpretability techniques would’ve diagnosed the problem given more work put into them).
And this is normal. Inventing something doesn’t mean you know everything about it. You shouldn’t be surprised that users of OA’s stuff learn things before OA; why would OA know as much about GPT poetry as I do? No one there researches GPT poetry. So it’s no surprise that they weren’t looking for reasons why GPT-2/3 couldn’t rhyme. More importantly than that, OA has been as surprised as anyone by things like inner-monologue. OA doesn’t know many things about its models, like the reason for the ‘unspeakable tokens’ or why DALL-E 2 couldn’t do anime*. Or consider how CLIP or ChatGPT took off. No, OA is certainly not omniscient. (Which is part of why OA has been talking about revisiting tokenization in order to better support foreign languages, which are harshly punished by current BPEs: even if you massively expand the BPE vocab to a million, you’re still handling poorly synthetic languages.)
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants, other than tailoring it slightly to byte encoding, and those have all proven to be flawed on LRA and/or scale poorly (eg Performer in that Tay paper before). If I wanted to point to a good proof of concept, I’d point to ByT5 still and also the text2image work showing how the economically-valuable task of generating images with arbitrary text is sabotaged by non-byte encodings even at PaLM scale. Which Google didn’t know even though they invented the image models in question. :thinking_face:
Byte encoding is my favored encoding in the long run and which I do expect to eventually take over, but I don’t know what the architecture is going to look like there or if byte-level tokenization will even require a ‘solution’ to giant context windows. Something else may win there, like a retrieval mechanism or going back to recurrency in a Perceiver-esque way.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
I see.
* DALL-E 2 anime was another case where OA insiders blew me off until I compiled enough cases that they had to admit there was something anomalous there and seem to have done a little bit about it given an early upgrade and then DALL-2-exp. Unfortunately, they never confirmed whether my CLIP-censoring hypothesis was correct. On the bright side, at least the DALL-E 2 paper acknowledged the damage done by BPEs.
Sorry, I wasn’t clear enough, or maybe I misunderstood your position.
I saw you liked MLP-mixer type designs because they’re simpler, and per your tweets and comment above, you seem to think larger models should need less complexity.
I consider complex network designs for training and for inference to be the same type of complexity, and “textbooks are all you need” is then evidence for more structural complexity being helpful.
You clarified things a bit here: apparently the basis for your position is that “architectural complexity is generally just inductive bias that becomes less important as more-flexible capabilities increase”. I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Depending on the details of your position, “textbooks are all you need” may or may not be evidence against it—do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Karpathy seemed to understand the issues. He went to OpenAI from Tesla somewhat recently, but I’d include Tesla in “OpenAI et al” and you were implying that all of those major AI labs (“everyone”) didn’t understand that.
This Tay paper? I see a bunch of sparse attention approaches in that survey, but I don’t see what MegaByte does in there. Maybe I missed an earlier paper, but the Performer design is completely different. MegaByte is certainly simpler than much of that stuff, but I guess people were looking in the wrong direction.
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
I guess we disagree about that.
Speaking of MLPs and how supposedly they don’t scale & silently fail whenever anyone tries, this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”, Bachmann et al 2023 (excerpts)
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they would do fine if scaled up.)
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
Byte encoding works well with non MegaByte type designs too...
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
(Wow, way to just throw Karpathy under the bus.)
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
Have fun playing with MLPs. I’m not trying to stop you, I’m just stating my position for audience members who understand it.
Might be good to post a hashed claim.