Good question. I don’t have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it’d be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is “harder” than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.
Good question. I don’t have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it’d be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is “harder” than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.
Relevant comment from Neel Nanda: https://twitter.com/NeelNanda5/status/1671094151633305602