Aren’t attention networks and MLPs both subsets of feedforward networks already? What you really mean is “Attention can implement fully-connected MLPs”?
Calling fully-connected MLPs “feedforward networks” is common (e.g. in the original transformer paper https://arxiv.org/pdf/1706.03762.pdf), so I tried to use that language here for the sake of the transformer-background people. But yes, I think “Attention can implement fully-connected MLPs” is a correct and arguably more accurate way to describe this.
Given the general contempt that MLPs are held in at present, and the extent to which people seem to regard self-attention as magic pixie dust which cannot be replicated by alternatives like CNNs or MLPs and which makes Transformers qualitatively different from anything before & solely responsible for the past ~4 years of DL progress (earlier discussion defending MLP prospects), it might be more useful to emphasize the other direction: if you can convert any self-attention to an equivalent fully-connected MLP, then that can be described as “there is a fully-connected MLP that implements your self-attention”. (Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right? Not the other way around, so converting an arbitrary MLP layer to a self-attention layer is presumably doable—at least with enough parameters—but remains unknown.)
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too). I wonder if you can tighten that to something much more relevant? My intuition is that MLPs are such powerful function approximators that you should be able to convert between much more similar-sized nets (and maybe smaller MLPs).
In either direction—perhaps you could just directly empirically approximate an exchange rate by training MLPs of various sizes to distill a self-attention layer? Given the sloppiness in attention patterns, it wouldn’t necessarily have to be all that accurate. And you could do this for each layer to de-attend a NN, which ought to have nice performance characteristics in addition to being a PoC.
(My prediction would be that the parameter-optimal MLP equivalent would have a width vs depth scaling law such that increasing large Transformer heads would be approximated by increasingly skinny deep MLP stacks, to allow switching/mixing by depth. And that you could probably come up with an initialization for the MLPs which makes them start off with self-attention-like activity, like you can come up with Transformer initializations that mimic CNN inductive priors. Then you could just drop the distillation entirely and create a MLPized Transformer from scratch.)
Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?
Either I’m misunderstanding you or you’re misunderstanding me, but I think I’ve shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don’t claim that you can do a self-attention → MLP construction.
Converting an arbitrary MLP layer to a self-attention layer is presumably doable—at least with enough parameters—but remains unknown
This is what I think I show here! Let the unknown be known!
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).
Yes, this is definitely at an “interesting trivia” level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don’t see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said “oh we can do interpretability on attention heads but not MLPs”, so the conversion of the later into the former might supplement their techniques.
Yes, you’re right. My bad; I was skimming in a hurry before heading out while focused on my own hobbyhorse of ‘how to make MLPs beat Transformers?’. Knew I was missing something, so glad I checked. Now that you put it that way, the intuition is a lot clearer, and shrinking it seems a lot harder: one head per hidden dim/neuron is a straightforward construction but also unclear how much you could be guaranteed to shrink it by trying to merge heads...
The empirical approach, in both directions, might be the best bet here, and has the advantage of being the sort of thing that someone junior could get interesting results on quickly with minimal hardware.
Aren’t attention networks and MLPs both subsets of feedforward networks already? What you really mean is “Attention can implement fully-connected MLPs”?
Calling fully-connected MLPs “feedforward networks” is common (e.g. in the original transformer paper https://arxiv.org/pdf/1706.03762.pdf), so I tried to use that language here for the sake of the transformer-background people. But yes, I think “Attention can implement fully-connected MLPs” is a correct and arguably more accurate way to describe this.
Given the general contempt that MLPs are held in at present, and the extent to which people seem to regard self-attention as magic pixie dust which cannot be replicated by alternatives like CNNs or MLPs and which makes Transformers qualitatively different from anything before & solely responsible for the past ~4 years of DL progress (earlier discussion defending MLP prospects),
it might be more useful to emphasize the other direction: if you can convert any self-attention to an equivalent fully-connected MLP, then that can be described as “there is a fully-connected MLP that implements your self-attention”. (Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right? Not the other way around, so converting an arbitrary MLP layer to a self-attention layer is presumably doable—at least with enough parameters—but remains unknown.)Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too). I wonder if you can tighten that to something much more relevant?
My intuition is that MLPs are such powerful function approximators that you should be able to convert between much more similar-sized nets (and maybe smaller MLPs).In either direction—perhaps you could just directly empirically approximate an exchange rate by training MLPs of various sizes to distill a self-attention layer? Given the sloppiness in attention patterns, it wouldn’t necessarily have to be all that accurate. And you could do this for each layer to de-attend a NN, which ought to have nice performance characteristics in addition to being a PoC.
(My prediction would be that the parameter-optimal MLP equivalent would have a width vs depth scaling law such that increasing large Transformer heads would be approximated by increasingly skinny deep MLP stacks, to allow switching/mixing by depth. And that you could probably come up with an initialization for the MLPs which makes them start off with self-attention-like activity, like you can come up with Transformer initializations that mimic CNN inductive priors. Then you could just drop the distillation entirely and create a MLPized Transformer from scratch.)
Either I’m misunderstanding you or you’re misunderstanding me, but I think I’ve shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don’t claim that you can do a self-attention → MLP construction.
This is what I think I show here! Let the unknown be known!
Yes, this is definitely at an “interesting trivia” level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don’t see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said “oh we can do interpretability on attention heads but not MLPs”, so the conversion of the later into the former might supplement their techniques.
Yes, you’re right. My bad; I was skimming in a hurry before heading out while focused on my own hobbyhorse of ‘how to make MLPs beat Transformers?’. Knew I was missing something, so glad I checked. Now that you put it that way, the intuition is a lot clearer, and shrinking it seems a lot harder: one head per hidden dim/neuron is a straightforward construction but also unclear how much you could be guaranteed to shrink it by trying to merge heads...
The empirical approach, in both directions, might be the best bet here, and has the advantage of being the sort of thing that someone junior could get interesting results on quickly with minimal hardware.