I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Also, TC0 is very much limited, see e.g. this presentation.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.