We theoretically analyze reversal curse where training or test sequences have the from βπ΄βπ΅β or βπ΅βπ΄β via training dynamics of (stochastic) gradient descent under two auto-regressive models: a bilinear model (Section 3) and one-layer transformers under certain assumptions similar to Tian et al. (2023a) (Section 4). The analysis of the training dynamics of both models reveals a core reason why the reversal curse happens: the weights of the autoregressive models are asymmetric, i.e., the increase of weights from the token π΄ to token π΅ during training does not necessarily cause the increase of the weights from π΅ to π΄. [...] Although the (effective) weights from π΄ to π΅ and from π΅ to π΄ might be related to some extent since they are both computed using the same set of embeddings, their correlation is weak and thus show asymmetry as verified both theoretically (Sections 3 and 4) and empirically (Section 5).
we use the above framework to show the necessity of chain-of-thought (COT) (Wei et al., 2022b) (i.e., a model trained on βA implies Bβ and βB implies Cβ separately struggles to directly conclude βA implies Cβ without COT, which was also empirically observed by Allen-Zhu and Li (2023)) via training dynamics of one-layer transformers (Section 4.2), which provides a new perspective different from previous work Feng et al. (2024) that focuses on the expressivity of transformers. Slightly different from the reason for the reversal curse, in COT analysis, the model weights show intransitivity, i.e., increasing the weights from the token π΄ to π΅ and π΅ to πΆ does not necessarily increase the weights from π΄ to πΆ. We emphasize again that the weights refer to effective weights.
We also empirically validate our theoretical results through the training dynamics of multi-layer transformers (Section 5).
The asymmetry and intransitivity of weights of auto-regressive models indicate that auto-regressive LLMs might not automatically deduce indirect conclusions using separate knowledge learned during training: to make a model predicting token π΅ where the input token is π΄, the model needs to see π΅ following π΄ in the same sequence during the training set due to the next token prediction objective and model architectures. This also highlights the importance of ICL, data augmentation, or planning for LLMs with the current popular causal transformer-based structures to solve complex reasoning tasks.
[Linkpost] Towards a Theoretical Understanding of the βReversal Curseβ via Training Dynamics
Link post
The excerpts below seem to me like a slight update towards arguments about the weakness of one-forward-passes in transformers and for agendas like externalized reasoning and translucent thoughts. They also suggest out-of-context reasoning (OOCR) might remain hard for transformer-based LMs (they currently do very poorly on OOCR evals and scaling also doesnβt seem to help much):