I really liked this post and would like to engage with it more later. It could be very useful!
However, I also think that it would be good for you to add a section reviewing previous academic work on this topic (eg: https://aclanthology.org/2021.emnlp-main.133.pdf. This seems very relevant and may not be the only academic work on this topic (I did not search long). Curious to hear what you find!
Thanks for the comment and linking that paper! I think this is about training dynamics though, norm growth as a function of checkpoint rather than layer index.
Generally I find basically no papers discussing the parameter or residual stream growth over layer number, all the similar-sounding papers seem to discuss parameter norms increasing as a function of epoch or checkpoint (training dynamics). I don’t expect the scaling over epoch and layer number to be related?
Only this paper mentions layer number in this context, and the paper is about solving the vanishing gradient problem in Post-LN transformers. I don’t think that problem applies to the Pre-LN architecture? (see the comment by Zach Furman for this discussion)
Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I’ll make here before making a separate comment: - The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here. - My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant.
I really liked this post and would like to engage with it more later. It could be very useful!
However, I also think that it would be good for you to add a section reviewing previous academic work on this topic (eg: https://aclanthology.org/2021.emnlp-main.133.pdf. This seems very relevant and may not be the only academic work on this topic (I did not search long). Curious to hear what you find!
Thanks for the comment and linking that paper! I think this is about training dynamics though, norm growth as a function of checkpoint rather than layer index.
Generally I find basically no papers discussing the parameter or residual stream growth over layer number, all the similar-sounding papers seem to discuss parameter norms increasing as a function of epoch or checkpoint (training dynamics). I don’t expect the scaling over epoch and layer number to be related?
Only this paper mentions layer number in this context, and the paper is about solving the vanishing gradient problem in Post-LN transformers. I don’t think that problem applies to the Pre-LN architecture? (see the comment by Zach Furman for this discussion)
Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I’ll make here before making a separate comment:
- The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here.
- My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant.