I looked over it and I should note that “transformers are in TC0” is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to “whatever computation is practical to implement”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.
I looked over it and I should note that “transformers are in TC0” is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to “whatever computation is practical to implement”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Also, TC0 is very much limited, see e.g. this presentation.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.