For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn’t work well enough for people to use it in practice.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context
Yup. I suspect that’s close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token.
I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what’s possible by existence proof (which turns out to be a lot).
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn’t work well enough for people to use it in practice.
Yup. I suspect that’s close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token.
I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what’s possible by existence proof (which turns out to be a lot).