Robert_AIZI comments on No Really, Attention is ALL You Need—Attention can do feedforward networks

Robert_AIZI 17 Jul 2023 21:46 UTC
3 points
0
Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?
Either I’m misunderstanding you or you’re misunderstanding me, but I think I’ve shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don’t claim that you can do a self-attention → MLP construction.
Converting an arbitrary MLP layer to a self-attention layer is presumably doable—at least with enough parameters—but remains unknown
This is what I think I show here! Let the unknown be known!
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).
Yes, this is definitely at an “interesting trivia” level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don’t see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said “oh we can do interpretability on attention heads but not MLPs”, so the conversion of the later into the former might supplement their techniques.
- gwern 18 Jul 2023 0:20 UTC
  6 points
  2
  Parent
  Yes, you’re right. My bad; I was skimming in a hurry before heading out while focused on my own hobbyhorse of ‘how to make MLPs beat Transformers?’. Knew I was missing something, so glad I checked. Now that you put it that way, the intuition is a lot clearer, and shrinking it seems a lot harder: one head per hidden dim/neuron is a straightforward construction but also unclear how much you could be guaranteed to shrink it by trying to merge heads...
  
  The empirical approach, in both directions, might be the best bet here, and has the advantage of being the sort of thing that someone junior could get interesting results on quickly with minimal hardware.