The Bitter Lesson applies to almost all attempts to build additional structure into neural networks, it turns out.
Out of curiosity, what are the other exceptions to this besides the obvious one of attention?
Off the top of my head: residual (skip) connections, improved ways of doing positional embeddings/encodings, and layer norm.
Out of curiosity, what are the other exceptions to this besides the obvious one of attention?
Off the top of my head: residual (skip) connections, improved ways of doing positional embeddings/encodings, and layer norm.