You’re correct that it arises because we can choose a large enough constant (proportional to parameter count, which is a constant with respect to inference), and because we have bounded context windows. Not all large language models must be constant time, nor are they.
The concerning part is that all the big name ones I’m aware of are running in constant time (per token) and still manage to do extremely well. Every time we see some form of intelligence expressed within a single token prediction on these models, we get a proof that that kind of intelligence is just not very complicated.
I just don’t intuitively follow. It violates my intuitions about algorithms and complexity.
Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long?
Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn’t vary, that also warrants explanation.
I just don’t buy the constant time hypothesis/formulation. It’s like: “if you’re getting that result, you’re doing something illegitimate or abusing the notion of complexity”.
Constant time per token generalising asymptotically becomes linear complexity, and there are problems that we know are worse than linear complexity. It’s like this result just isn’t plausible?
Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long?
Yes, if you modified the forward pass to output a million tokens at once, it would remain constant time so long as the forward pass’s execution remained bounded by a constant. Likewise, you could change the output distribution to cover tokens of extreme length. Realistically, the architecture wouldn’t be practical. It would be either enormous and slow or its predictions would suck.
Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn’t vary, that also warrants explanation.
No, a given GPT-like transformer always does exactly the same thing in the forward pass. GPT-3 does not have any kind of adaptive computation time within a forward pass. If a single token prediction requires more computation steps than fits in the (large) constant time available to the forward pass, the transformer cannot fully complete the prediction. This is near the core of the “wait what” response I had to GPT-3′s performance.
Note that when you prompt GPT-3 with something like “1401749 x23170802 = ”, it will tend to give you a prediction which matches the shape of the answer (it’s a number, and a fairly big one), but beyond the rough approximation, it’s pretty much always going to be wrong. Even if you fine-tuned GPT-3 on arithmetic, you would still be able to find two integers of sufficient size that they cannot be multiplied in one step because the number internal steps required exceeds the number of steps the forward pass can express.
The output distribution will cover a wide smear of tokens corresponding to approximately-correct big numbers. It can’t compute which one is right, so the probability distribution can’t narrow any further.
(Raw GPT-3 also isn’t interested in being correct except to the extent that being correct corresponds to a good token prediction, so it won’t bother with trying to output intermediate tokens that could let it perform a longer non-constant-time computation. The prompt makes it look like the next token should be an answer, not incremental reasoning, so it’ll sample from its smear of answer-shaped tokens.)
“if you’re getting that result, you’re doing something illegitimate or abusing the notion of complexity”
It can feel that way a bit due to the scale- like it’s technically true but meaningless, maybe- but it is directly visible in their behavior, and it has major implications for how these things could advance. The constants were talking about here aren’t actually so pathological that they make the analysis pointless; a quadrillion flops isn’t that many compared to an H100′s throughput.
It’s like this result just isn’t plausible?
If I had to guess, that feeling is probably arising from how I’m using the word “intelligence” in the context of a single forward pass, while it sounds like you’re using the word to describe something more like… the capability of the model over a full simulation that takes many forward passes to complete. The latter is not bound by constant time even in the LLMs I’m talking about, but I think the capabilities exhibited in constant time by current LLMs are sufficient to think that huge chunks of “whatever is the thing we care about when we talk about intelligence” are actually pretty darn easy, computationally.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn’t work well enough for people to use it in practice.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context
Yup. I suspect that’s close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token.
I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what’s possible by existence proof (which turns out to be a lot).
You’re correct that it arises because we can choose a large enough constant (proportional to parameter count, which is a constant with respect to inference), and because we have bounded context windows. Not all large language models must be constant time, nor are they.
The concerning part is that all the big name ones I’m aware of are running in constant time (per token) and still manage to do extremely well. Every time we see some form of intelligence expressed within a single token prediction on these models, we get a proof that that kind of intelligence is just not very complicated.
I just don’t intuitively follow. It violates my intuitions about algorithms and complexity.
Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long?
Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn’t vary, that also warrants explanation.
I just don’t buy the constant time hypothesis/formulation. It’s like: “if you’re getting that result, you’re doing something illegitimate or abusing the notion of complexity”.
Constant time per token generalising asymptotically becomes linear complexity, and there are problems that we know are worse than linear complexity. It’s like this result just isn’t plausible?
Yes, if you modified the forward pass to output a million tokens at once, it would remain constant time so long as the forward pass’s execution remained bounded by a constant. Likewise, you could change the output distribution to cover tokens of extreme length. Realistically, the architecture wouldn’t be practical. It would be either enormous and slow or its predictions would suck.
No, a given GPT-like transformer always does exactly the same thing in the forward pass. GPT-3 does not have any kind of adaptive computation time within a forward pass. If a single token prediction requires more computation steps than fits in the (large) constant time available to the forward pass, the transformer cannot fully complete the prediction. This is near the core of the “wait what” response I had to GPT-3′s performance.
Note that when you prompt GPT-3 with something like “1401749 x 23170802 = ”, it will tend to give you a prediction which matches the shape of the answer (it’s a number, and a fairly big one), but beyond the rough approximation, it’s pretty much always going to be wrong. Even if you fine-tuned GPT-3 on arithmetic, you would still be able to find two integers of sufficient size that they cannot be multiplied in one step because the number internal steps required exceeds the number of steps the forward pass can express.
The output distribution will cover a wide smear of tokens corresponding to approximately-correct big numbers. It can’t compute which one is right, so the probability distribution can’t narrow any further.
(Raw GPT-3 also isn’t interested in being correct except to the extent that being correct corresponds to a good token prediction, so it won’t bother with trying to output intermediate tokens that could let it perform a longer non-constant-time computation. The prompt makes it look like the next token should be an answer, not incremental reasoning, so it’ll sample from its smear of answer-shaped tokens.)
It can feel that way a bit due to the scale- like it’s technically true but meaningless, maybe- but it is directly visible in their behavior, and it has major implications for how these things could advance. The constants were talking about here aren’t actually so pathological that they make the analysis pointless; a quadrillion flops isn’t that many compared to an H100′s throughput.
If I had to guess, that feeling is probably arising from how I’m using the word “intelligence” in the context of a single forward pass, while it sounds like you’re using the word to describe something more like… the capability of the model over a full simulation that takes many forward passes to complete. The latter is not bound by constant time even in the LLMs I’m talking about, but I think the capabilities exhibited in constant time by current LLMs are sufficient to think that huge chunks of “whatever is the thing we care about when we talk about intelligence” are actually pretty darn easy, computationally.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn’t work well enough for people to use it in practice.
Yup. I suspect that’s close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token.
I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what’s possible by existence proof (which turns out to be a lot).