From the “Conclusion and Future Directions” section of the colab notebook:
Most of all, we cannot handwave away LayerNorm as “just doing normalization”; this would be analogous to describing ReLU as “just making things nonnegative”.
I don’t think we know too much about what exactly LayerNorm is doing in full-scale models, but at least in smaller models, I believe we’ve found evidence of transformers using LayerNorm to do nontrivial computations[1].
From the “Conclusion and Future Directions” section of the colab notebook:
I don’t think we know too much about what exactly LayerNorm is doing in full-scale models, but at least in smaller models, I believe we’ve found evidence of transformers using LayerNorm to do nontrivial computations[1].
I think I vaguely recall something about this in either Neel Nanda’s “Rederiving Positional Encodings” stuff, or Stefan Heimersheim + Kajetan Janiak’s work on 4-layer Attn-only transformers, but I could totally be misremembering, sorry.