A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers

Authors: Anonymous (I’m not one of them).

Abstract:

Most analysis of transformer expressivity treats the depth (number of layers) of a model as a fixed constant, and analyzes the kinds of problems such models can solve across inputs of unbounded length. In practice, however, the context length of a trained transformer model is bounded. Thus, a more pragmatic question is: What kinds of computation can a transformer perform on inputs of bounded length? We formalize this by studying highly uniform transformers where the depth can grow minimally with context length. In this regime, we show that transformers with depth O(log⁡C) can, in fact, compute solutions to two important problems for inputs bounded by some max context length C, namely simulating finite automata, which relates to the ability to track state, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems have previously been proven to be asymptotically beyond the reach of fixed depth transformers under standard complexity conjectures, yet empirically transformer models can successfully track state and perform multi-hop reasoning on short contexts. Our novel analysis thus explains how transformer models may rely on depth to feasibly solve problems up to bounded context that they cannot solve over long contexts. It makes actionable suggestions for practitioners as to how to minimally scale the depth of a transformer to support reasoning over long contexts, and also argues for dynamically unrolling depth as a more effective way of adding compute compared to increasing model dimension or adding a short chain of thought.