The helix is already pretty long, so maybe layernorm is responsible?
E.g. to do position-independent look-back we want the geometry of the embedding to be invariant to some euclidean embedding of the 1D translation group. If you have enough space handy it makes sense for this to be a line. But if you only have a bounded region to work with, and you want to keep the individual position embeddings a certain distance apart, you are forced to “curl” the line up into a more complex representation (screw transformations) because you need the position-embedding curve to simultaneously have high length while staying close to the origin.
Actually, layernorms may directly ruin the linear case by projecting it away, so you actually want an approximate group-symmetry that lives on the sphere. In this picture the natural shape for shorter lengths is a circle, and for longer lengths we are forced to stretch it into a separate dimension if we aren’t willing to make the circle arbitrarily dense.
People here describe themselves as “pessimistic” about a variety of aspects of AI risk on a very regular basis, so this seems like an isolated demand for rigor.
This seems like a weird bait and switch to me, where an object-level argument is only ever allowed to conclude in a neutral middle-ground conclusion. A “neutral, balanced view of possibilities” is absolutely allowed to end on a strong conclusion without a forest of caveats. You switch your reading of “optimism” partway through this paragraph in a way that seems inconsistent with your earlier comment, in such a way that smuggles in the conclusion “any purely factual argument will express a wide range of concerns and uncertainties, or else it is biased”.