Yeah I think we have the same understanding here (in hindsight I should have made this more explicit in the post / title).
I would be excited to see someone empirically try to answer the question you mention at the end. In particular, given some direction u and a LayerNormed vector v, one might try to quantify how smoothly rotating from v towards u changes the output of the MLP layer. This seems like a good test of whether the Polytope Lens is helpful / necessary for understanding the MLPs of Transformers (with smooth changes corresponding to your ‘random jostling cancels out’ corresponding to not needing to worry about Polytope Lens style issues).
Also: It seems like there would be an easier way to get this observation that this post makes, ie. directly showing that kV and V get mapped to the same point by layer norm (excluding the epsilon).
Don’t get me wrong, the circle is cool, but seems like it’s a bit of a detour.
Yeah I think we have the same understanding here (in hindsight I should have made this more explicit in the post / title).
I would be excited to see someone empirically try to answer the question you mention at the end. In particular, given some direction u and a LayerNormed vector v, one might try to quantify how smoothly rotating from v towards u changes the output of the MLP layer. This seems like a good test of whether the Polytope Lens is helpful / necessary for understanding the MLPs of Transformers (with smooth changes corresponding to your ‘random jostling cancels out’ corresponding to not needing to worry about Polytope Lens style issues).
Also: It seems like there would be an easier way to get this observation that this post makes, ie. directly showing that kV and V get mapped to the same point by layer norm (excluding the epsilon).
Don’t get me wrong, the circle is cool, but seems like it’s a bit of a detour.