A bit late to the party. Love the article, but I believe it is somewhat misleading when you say that transformers run in constant time complexity.
If the number of tokens in the input sentence is the input size of its time complexity, which I’m sure you can agree is the obvious choice; The transformer encoder is run on each token in the sentence, in parallel if needed, but it still has to do all of its computations for each input token, immediately causing at least O(n) time.
I do think that the point you are trying to give is different though. Correct me if I’m wrong but it seems like you are saying that for each token generated, the transformer is only allowed to process for a constant amount of time, and therefore it is impossible for it to do division or other things that require time complexities more complicated than O(1), without failing at some hard problem. Additionally assuming it is only generating one token.
If that is your argument I do still find it interesting, as it means that a large portion of intelligence can be simulated in O(1). But all I am trying to state is that that is not the actually the time complexity of the transformer. It is the time complexity that the transformer can simulate, assuming it is only generating one token.
If the number of tokens in the input sentence is the input size of its time complexity, which I’m sure you can agree is the obvious choice
Yeah, you’re not alone in thinking that- I think several people have been tripped up by that in the post. Without making it clear, my analysis just assumed that the context window was bounded by some constant, so scaling with respect to token counts went out the window. So:
Correct me if I’m wrong but it seems like you are saying that for each token generated, the transformer is only allowed to process for a constant amount of time … Additionally assuming it is only generating one token.
Yup.
This is one of the things I’m clarifying for the openphil submission version of the post, along with a section trying to better tie together why it matters. (More than one person has come away thinking something like “but every algorithm bottoms out at individual constant time steps, this isn’t interesting, CoT etc.”)
A bit late to the party. Love the article, but I believe it is somewhat misleading when you say that transformers run in constant time complexity.
If the number of tokens in the input sentence is the input size of its time complexity, which I’m sure you can agree is the obvious choice; The transformer encoder is run on each token in the sentence, in parallel if needed, but it still has to do all of its computations for each input token, immediately causing at least O(n) time.
I do think that the point you are trying to give is different though. Correct me if I’m wrong but it seems like you are saying that for each token generated, the transformer is only allowed to process for a constant amount of time, and therefore it is impossible for it to do division or other things that require time complexities more complicated than O(1), without failing at some hard problem. Additionally assuming it is only generating one token.
If that is your argument I do still find it interesting, as it means that a large portion of intelligence can be simulated in O(1). But all I am trying to state is that that is not the actually the time complexity of the transformer. It is the time complexity that the transformer can simulate, assuming it is only generating one token.
Yeah, you’re not alone in thinking that- I think several people have been tripped up by that in the post. Without making it clear, my analysis just assumed that the context window was bounded by some constant, so scaling with respect to token counts went out the window. So:
Yup.
This is one of the things I’m clarifying for the openphil submission version of the post, along with a section trying to better tie together why it matters. (More than one person has come away thinking something like “but every algorithm bottoms out at individual constant time steps, this isn’t interesting, CoT etc.”)
Alright. Interested to see the new post. Your content is great btw.
Thanks! Just updated the edited version link to the latest version too. Unfortunately, the main new content is redacted, so it’ll be pretty familiar.