Another simpler but less illuminating way to put this is that higher serial reasoning depth can’t be parallelized.[1]
Transformers do get more computation per token on longer sequences, but they also don’t get more serial depth, so I’m not sure if this is actually an issue in practice?
[C]ompactly represent f∘g (f composed with g) in a way that makes computing f(g(x)) more efficient for general choices of f and g.
As an aside, I actually can’t think of any class of interesting functions with this property—when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)
I mean, yeah, as your footnote says:
Transformers do get more computation per token on longer sequences, but they also don’t get more serial depth, so I’m not sure if this is actually an issue in practice?
As an aside, I actually can’t think of any class of interesting functions with this property—when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)