The helix is already pretty long, so maybe layernorm is responsible?
E.g. to do position-independent look-back we want the geometry of the embedding to be invariant to some euclidean embedding of the 1D translation group. If you have enough space handy it makes sense for this to be a line. But if you only have a bounded region to work with, and you want to keep the individual position embeddings a certain distance apart, you are forced to “curl” the line up into a more complex representation (screw transformations) because you need the position-embedding curve to simultaneously have high length while staying close to the origin.
Actually, layernorms may directly ruin the linear case by projecting it away, so you actually want an approximate group-symmetry that lives on the sphere. In this picture the natural shape for shorter lengths is a circle, and for longer lengths we are forced to stretch it into a separate dimension if we aren’t willing to make the circle arbitrarily dense.
The helix is already pretty long, so maybe layernorm is responsible?
E.g. to do position-independent look-back we want the geometry of the embedding to be invariant to some euclidean embedding of the 1D translation group. If you have enough space handy it makes sense for this to be a line. But if you only have a bounded region to work with, and you want to keep the individual position embeddings a certain distance apart, you are forced to “curl” the line up into a more complex representation (screw transformations) because you need the position-embedding curve to simultaneously have high length while staying close to the origin.
Actually, layernorms may directly ruin the linear case by projecting it away, so you actually want an approximate group-symmetry that lives on the sphere. In this picture the natural shape for shorter lengths is a circle, and for longer lengths we are forced to stretch it into a separate dimension if we aren’t willing to make the circle arbitrarily dense.