ViTs aren’t increased architecture complexity compared to what they replaced.
People have tried them. You just don’t get published unless you show progress.
I see.
You think you know something about tokenizers that OpenAI et al don’t, huh?
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams. Other people are ignorant too; even computer poetry people don’t know it, eg in April Cynthia Rudin’s comments on her old GPT poetry research shows she still doesn’t understand why GPT-2 wouldn’t rhyme or why she’s also wrong when she claims ChatGPT can rhyme (or, for that matter, why the oddities of GPT poetry do not provide a good justification of the need for her interpretability work: because I didn’t need any interpretability research to figure out BPEs were the problem, her methods have not figured it out yet, and it’s far from obvious that any of her interpretability techniques would’ve diagnosed the problem given more work put into them).
And this is normal. Inventing something doesn’t mean you know everything about it. You shouldn’t be surprised that users of OA’s stuff learn things before OA; why would OA know as much about GPT poetry as I do? No one there researches GPT poetry. So it’s no surprise that they weren’t looking for reasons why GPT-2/3 couldn’t rhyme. More importantly than that, OA has been as surprised as anyone by things like inner-monologue. OA doesn’t know many things about its models, like the reason for the ‘unspeakable tokens’ or why DALL-E 2 couldn’t do anime*. Or consider how CLIP or ChatGPT took off. No, OA is certainly not omniscient. (Which is part of why OA has been talking about revisiting tokenization in order to better support foreign languages, which are harshly punished by current BPEs: even if you massively expand the BPE vocab to a million, you’re still handling poorly synthetic languages.)
Perhaps something like Meta’s MegaByte will replace them, but that’s not a design you’d suggested.
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants, other than tailoring it slightly to byte encoding, and those have all proven to be flawed on LRA and/or scale poorly (eg Performer in that Tay paper before). If I wanted to point to a good proof of concept, I’d point to ByT5 still and also the text2image work showing how the economically-valuable task of generating images with arbitrary text is sabotaged by non-byte encodings even at PaLM scale. Which Google didn’t know even though they invented the image models in question. :thinking_face:
Byte encoding is my favored encoding in the long run and which I do expect to eventually take over, but I don’t know what the architecture is going to look like there or if byte-level tokenization will even require a ‘solution’ to giant context windows. Something else may win there, like a retrieval mechanism or going back to recurrency in a Perceiver-esque way.
Because the overall performance was better.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
I know what the self-attention does and the answer is “no”. I will not be posting an explanation until something close enough and not too obscure is published.
I see.
* DALL-E 2 anime was another case where OA insiders blew me off until I compiled enough cases that they had to admit there was something anomalous there and seem to have done a little bit about it given an early upgrade and then DALL-2-exp. Unfortunately, they never confirmed whether my CLIP-censoring hypothesis was correct. On the bright side, at least the DALL-E 2 paper acknowledged the damage done by BPEs.
ViTs aren’t increased architecture complexity compared to what they replaced.
Sorry, I wasn’t clear enough, or maybe I misunderstood your position.
I saw you liked MLP-mixer type designs because they’re simpler, and per your tweets and comment above, you seem to think larger models should need less complexity.
I consider complex network designs for training and for inference to be the same type of complexity, and “textbooks are all you need” is then evidence for more structural complexity being helpful.
You clarified things a bit here: apparently the basis for your position is that “architectural complexity is generally just inductive bias that becomes less important as more-flexible capabilities increase”. I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Depending on the details of your position, “textbooks are all you need” may or may not be evidence against it—do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams.
Karpathy seemed to understand the issues. He went to OpenAI from Tesla somewhat recently, but I’d include Tesla in “OpenAI et al” and you were implying that all of those major AI labs (“everyone”) didn’t understand that.
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants
eg Performer in that Tay paper
This Tay paper? I see a bunch of sparse attention approaches in that survey, but I don’t see what MegaByte does in there. Maybe I missed an earlier paper, but the Performer design is completely different. MegaByte is certainly simpler than much of that stuff, but I guess people were looking in the wrong direction.
Byte encoding is my favored encoding in the long run
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative “less inductive bias is better”, popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, being completely free of any inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects.
We show that the performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.
Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they woulddofineif scaled up.)
and “textbooks are all you need” is then evidence for more structural complexity being helpful.
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
Karpathy seemed to understand the issues.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
Maybe I missed an earlier paper, but the Performer design is completely different.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
Byte encoding works well with non MegaByte type designs too...
this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person!
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
I see; somehow I thought he was smarter than that.
(Wow, way to just throw Karpathy under the bus.)
The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
ViTs aren’t increased architecture complexity compared to what they replaced.
I see.
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams. Other people are ignorant too; even computer poetry people don’t know it, eg in April Cynthia Rudin’s comments on her old GPT poetry research shows she still doesn’t understand why GPT-2 wouldn’t rhyme or why she’s also wrong when she claims ChatGPT can rhyme (or, for that matter, why the oddities of GPT poetry do not provide a good justification of the need for her interpretability work: because I didn’t need any interpretability research to figure out BPEs were the problem, her methods have not figured it out yet, and it’s far from obvious that any of her interpretability techniques would’ve diagnosed the problem given more work put into them).
And this is normal. Inventing something doesn’t mean you know everything about it. You shouldn’t be surprised that users of OA’s stuff learn things before OA; why would OA know as much about GPT poetry as I do? No one there researches GPT poetry. So it’s no surprise that they weren’t looking for reasons why GPT-2/3 couldn’t rhyme. More importantly than that, OA has been as surprised as anyone by things like inner-monologue. OA doesn’t know many things about its models, like the reason for the ‘unspeakable tokens’ or why DALL-E 2 couldn’t do anime*. Or consider how CLIP or ChatGPT took off. No, OA is certainly not omniscient. (Which is part of why OA has been talking about revisiting tokenization in order to better support foreign languages, which are harshly punished by current BPEs: even if you massively expand the BPE vocab to a million, you’re still handling poorly synthetic languages.)
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants, other than tailoring it slightly to byte encoding, and those have all proven to be flawed on LRA and/or scale poorly (eg Performer in that Tay paper before). If I wanted to point to a good proof of concept, I’d point to ByT5 still and also the text2image work showing how the economically-valuable task of generating images with arbitrary text is sabotaged by non-byte encodings even at PaLM scale. Which Google didn’t know even though they invented the image models in question. :thinking_face:
Byte encoding is my favored encoding in the long run and which I do expect to eventually take over, but I don’t know what the architecture is going to look like there or if byte-level tokenization will even require a ‘solution’ to giant context windows. Something else may win there, like a retrieval mechanism or going back to recurrency in a Perceiver-esque way.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
I see.
* DALL-E 2 anime was another case where OA insiders blew me off until I compiled enough cases that they had to admit there was something anomalous there and seem to have done a little bit about it given an early upgrade and then DALL-2-exp. Unfortunately, they never confirmed whether my CLIP-censoring hypothesis was correct. On the bright side, at least the DALL-E 2 paper acknowledged the damage done by BPEs.
Sorry, I wasn’t clear enough, or maybe I misunderstood your position.
I saw you liked MLP-mixer type designs because they’re simpler, and per your tweets and comment above, you seem to think larger models should need less complexity.
I consider complex network designs for training and for inference to be the same type of complexity, and “textbooks are all you need” is then evidence for more structural complexity being helpful.
You clarified things a bit here: apparently the basis for your position is that “architectural complexity is generally just inductive bias that becomes less important as more-flexible capabilities increase”. I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Depending on the details of your position, “textbooks are all you need” may or may not be evidence against it—do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Karpathy seemed to understand the issues. He went to OpenAI from Tesla somewhat recently, but I’d include Tesla in “OpenAI et al” and you were implying that all of those major AI labs (“everyone”) didn’t understand that.
This Tay paper? I see a bunch of sparse attention approaches in that survey, but I don’t see what MegaByte does in there. Maybe I missed an earlier paper, but the Performer design is completely different. MegaByte is certainly simpler than much of that stuff, but I guess people were looking in the wrong direction.
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
I guess we disagree about that.
Speaking of MLPs and how supposedly they don’t scale & silently fail whenever anyone tries, this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”, Bachmann et al 2023 (excerpts)
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they would do fine if scaled up.)
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
Byte encoding works well with non MegaByte type designs too...
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
(Wow, way to just throw Karpathy under the bus.)
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
Have fun playing with MLPs. I’m not trying to stop you, I’m just stating my position for audience members who understand it.