Tl;Dr: LLMs perform O(1) computational steps per generated token and this is true regardless of the generated token.
The LLM sees each token in its context window when generating the next token so can compute problems in O(n^2) [where n is the context window size].
LLMs can get along the computational requirements by “showing their working” and simulating a mechanical computer (one without backtracking, so not Turing complete) in their context window.
This only works if the context window is large enough to contain the workings for the entire algorithm.
Thus LLMs can perform matrix multiplication when showing workings, but not when asked to compute it without showing workings.
Important fundamental limitation on the current paradigm.
We can now say with certainty tasks that GPT will never be able to solve (e.g. beat Stockfish at Chess because Chess is combinatorial and the LLM can’t search the game tree to any depth) no matter how far it’s scaled up.
The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we’ve identified a suitable asymptotic order on the function, we can say intelligent things like “the smallest network capable of solving a problem in complexity class C of size N is X”.
Or if our asymptotic bounds are not tight enough:
“No economically feasible LLM can solve problems in complexity class C of size >= N”.
(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)
Regardless, we can still obtain meaningful impossibility results.
Strongly upvoted.
Short but powerful.
Tl;Dr: LLMs perform O(1) computational steps per generated token and this is true regardless of the generated token.
The LLM sees each token in its context window when generating the next token so can compute problems in O(n^2) [where n is the context window size].
LLMs can get along the computational requirements by “showing their working” and simulating a mechanical computer (one without backtracking, so not Turing complete) in their context window.
This only works if the context window is large enough to contain the workings for the entire algorithm.
Thus LLMs can perform matrix multiplication when showing workings, but not when asked to compute it without showing workings.
Important fundamental limitation on the current paradigm.
We can now say with certainty tasks that GPT will never be able to solve (e.g. beat Stockfish at Chess because Chess is combinatorial and the LLM can’t search the game tree to any depth) no matter how far it’s scaled up.
This is a very powerful argument.
Very big caveat: the LLM doesn’t actually perform O(1) computations per generated token.
The number of computational steps performed per generated token scales with network size: https://www.lesswrong.com/posts/XNBZPbxyYhmoqD87F/llms-and-computation-complexity?commentId=QWEwFcMLFQ678y5Jp
Caveat to the caveat: