this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person!
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
I see; somehow I thought he was smarter than that.
(Wow, way to just throw Karpathy under the bus.)
The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
(Wow, way to just throw Karpathy under the bus.)
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
Have fun playing with MLPs. I’m not trying to stop you, I’m just stating my position for audience members who understand it.