Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Bidirectional: “Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse”, Lv et al 2023 (GLM); “Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models”, Yang & Wang 2023
Sorta related: “Untying the Reversal Curse via Bidirectional Language Model Editing”, Ma et al 2023
Reverse training: “Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training”, Guo et al 2024; “Reverse Training to Nurse the Reversal Curse”, Golonev et al 2024 - claims data/compute-matched reversed training not only improves reversal curse but also improves regular performance too (which is not too surprising given how bidirectional models are usually better and diminishing returns from predicting just one kind of masking, final-token masking, but still mildly surprising)
Very interesting. Yeah, I’m starting to doubt the idea that Reversal Curse is any sort of problem for LLMs at all, and is probably trivial to fix.
Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Bidirectional: “Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse”, Lv et al 2023 (GLM); “Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models”, Yang & Wang 2023
Sorta related: “Untying the Reversal Curse via Bidirectional Language Model Editing”, Ma et al 2023
Reverse training: “Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training”, Guo et al 2024; “Reverse Training to Nurse the Reversal Curse”, Golonev et al 2024 - claims data/compute-matched reversed training not only improves reversal curse but also improves regular performance too (which is not too surprising given how bidirectional models are usually better and diminishing returns from predicting just one kind of masking, final-token masking, but still mildly surprising)
Very interesting. Yeah, I’m starting to doubt the idea that Reversal Curse is any sort of problem for LLMs at all, and is probably trivial to fix.