Very cool! I believe this structure allows expressing the “look back N tokens” operation (perhaps even for different Ns across different heads) via a position-independent rotation (and translation?) of the positional subspace of query/key vectors. This sort of operation is useful if many patterns in the dataset depend on the relative arrangement of tokens (for ex. common n-grams) rather than their absolute positions. Since all these models use absolute positional embeddings, the positional embeddings have to contort themselves to make this happen.
Oh, interesting! Can you explain why the “look back N tokens” operation would have been less easily expressible if all the points had been on a single line? I’m not sure I understand yet the advantage of a helix over a straight line.
Is there any sort of regularization in the training process, favouring parameters that aren’t particularly large in magnitude? I suspect that even a very shallow gradient toward parameters with smaller absolute magnitude would favour more compact representations that retain symmetries.
Good question. I don’t have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it’d be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is “harder” than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.
The helix is already pretty long, so maybe layernorm is responsible?
E.g. to do position-independent look-back we want the geometry of the embedding to be invariant to some euclidean embedding of the 1D translation group. If you have enough space handy it makes sense for this to be a line. But if you only have a bounded region to work with, and you want to keep the individual position embeddings a certain distance apart, you are forced to “curl” the line up into a more complex representation (screw transformations) because you need the position-embedding curve to simultaneously have high length while staying close to the origin.
Actually, layernorms may directly ruin the linear case by projecting it away, so you actually want an approximate group-symmetry that lives on the sphere. In this picture the natural shape for shorter lengths is a circle, and for longer lengths we are forced to stretch it into a separate dimension if we aren’t willing to make the circle arbitrarily dense.
Very cool! I believe this structure allows expressing the “look back N tokens” operation (perhaps even for different Ns across different heads) via a position-independent rotation (and translation?) of the positional subspace of query/key vectors. This sort of operation is useful if many patterns in the dataset depend on the relative arrangement of tokens (for ex. common n-grams) rather than their absolute positions. Since all these models use absolute positional embeddings, the positional embeddings have to contort themselves to make this happen.
Oh, interesting! Can you explain why the “look back N tokens” operation would have been less easily expressible if all the points had been on a single line? I’m not sure I understand yet the advantage of a helix over a straight line.
Is there any sort of regularization in the training process, favouring parameters that aren’t particularly large in magnitude? I suspect that even a very shallow gradient toward parameters with smaller absolute magnitude would favour more compact representations that retain symmetries.
Good question. I don’t have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it’d be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is “harder” than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.
Relevant comment from Neel Nanda: https://twitter.com/NeelNanda5/status/1671094151633305602
The helix is already pretty long, so maybe layernorm is responsible?
E.g. to do position-independent look-back we want the geometry of the embedding to be invariant to some euclidean embedding of the 1D translation group. If you have enough space handy it makes sense for this to be a line. But if you only have a bounded region to work with, and you want to keep the individual position embeddings a certain distance apart, you are forced to “curl” the line up into a more complex representation (screw transformations) because you need the position-embedding curve to simultaneously have high length while staying close to the origin.
Actually, layernorms may directly ruin the linear case by projecting it away, so you actually want an approximate group-symmetry that lives on the sphere. In this picture the natural shape for shorter lengths is a circle, and for longer lengths we are forced to stretch it into a separate dimension if we aren’t willing to make the circle arbitrarily dense.
A line is just a helix that doesn’t curve. It works the same for any helix; it would be a great coincidence, to get a line.