This seems like such an obvious question that I’m worried I’m missing something but… you phrase it as ‘A to B doesn’t cause B to A’, and people are using examples like ‘you can’t recite the alphabet backwards as easily as you can forwards’, and when I look at the list of ‘different training setups’, I see the very most obvious one not mentioned:
It’s possible that a different training setup would avoid the Reversal Curse. We try different setups in an effort to help the model generalize. Nothing helps. Specifically, we try:
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM? They only have a one-way flow because you set it up that way, there’s certainly nothing intrinsic about the ‘predict a token’ which constrains you to causal decoding—you can mask and predict any darn pattern of any darn data you please, it all is differentiable and backpropable and a loss to minimize. Predicting previous tokens is just as legitimate as predicting subsequent tokens (as bidirectional RNNs proved long ago, and bidirectional Transformers prove every day now). If the problem is that the dataset is chockful of statements like “Who won the Fields Medal in 1990? Ed Witten” and not “For his work did Ed Witten win a Fields Medal in 1990”, then reversing the text would seem to reverse most of them and create the B->A versions. I mean, if I had spent as much time as a child singing the alphabet song backwards as I did singing it forward, I expect that I would have little trouble going backwards in the alphabet as fluently as I do forwards!
(It’s unclear to me that this would even come at much of an expense in pretraining if you reversed half the inputs at random, because it’s still a powerful training signal.)
Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Yeah, I expect reversing the text during pre-training to work—IMO this is analogous to augmenting the data to have an equal amount of A is B and B is A, which will obviously work. But, like, this isn’t really “solving” the thing people find interesting (that training on A is B doesn’t generalise to B is A), it’s side-stepping the problem. Maybe I’m just being picky though, I agree it should work.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves). [If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction. If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.
I agree that training backwards would likely fix this for a causal decoder LLM.
I would define the Reversal Curse as the phenomenon by which models cannot infer ‘B → A’ by training on examples of the form ‘A → B’. In our paper we weren’t so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote, “We try different setups in an effort to help the model generalize,” we were referring to setups in which a model infers ‘B → A’ without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM?
I had basically the same idea here! I also expect that would work.
More generally, I think this kind of research (and also a lot of interpretability work) is interesting as a characterization and categorization of the workings and deficiencies of current systems and training processes, but not likely to be particularly useful for predicting trends or modelling systems in even the very near future (or the present, arguably… if you want an LLM to tell you about Mary Lee Pfieffer or Ed Witten, just use Bing).
This seems like such an obvious question that I’m worried I’m missing something but… you phrase it as ‘A to B doesn’t cause B to A’, and people are using examples like ‘you can’t recite the alphabet backwards as easily as you can forwards’, and when I look at the list of ‘different training setups’, I see the very most obvious one not mentioned:
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM? They only have a one-way flow because you set it up that way, there’s certainly nothing intrinsic about the ‘predict a token’ which constrains you to causal decoding—you can mask and predict any darn pattern of any darn data you please, it all is differentiable and backpropable and a loss to minimize. Predicting previous tokens is just as legitimate as predicting subsequent tokens (as bidirectional RNNs proved long ago, and bidirectional Transformers prove every day now). If the problem is that the dataset is chockful of statements like “Who won the Fields Medal in 1990? Ed Witten” and not “For his work did Ed Witten win a Fields Medal in 1990”, then reversing the text would seem to reverse most of them and create the B->A versions. I mean, if I had spent as much time as a child singing the alphabet song backwards as I did singing it forward, I expect that I would have little trouble going backwards in the alphabet as fluently as I do forwards!
(It’s unclear to me that this would even come at much of an expense in pretraining if you reversed half the inputs at random, because it’s still a powerful training signal.)
Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Bidirectional: “Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse”, Lv et al 2023 (GLM); “Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models”, Yang & Wang 2023
Sorta related: “Untying the Reversal Curse via Bidirectional Language Model Editing”, Ma et al 2023
Reverse training: “Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training”, Guo et al 2024; “Reverse Training to Nurse the Reversal Curse”, Golonev et al 2024 - claims data/compute-matched reversed training not only improves reversal curse but also improves regular performance too (which is not too surprising given how bidirectional models are usually better and diminishing returns from predicting just one kind of masking, final-token masking, but still mildly surprising)
Very interesting. Yeah, I’m starting to doubt the idea that Reversal Curse is any sort of problem for LLMs at all, and is probably trivial to fix.
Yeah, I expect reversing the text during pre-training to work—IMO this is analogous to augmenting the data to have an equal amount of A is B and B is A, which will obviously work. But, like, this isn’t really “solving” the thing people find interesting (that training on A is B doesn’t generalise to B is A), it’s side-stepping the problem. Maybe I’m just being picky though, I agree it should work.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves).
[If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction.
If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.
I agree that training backwards would likely fix this for a causal decoder LLM.
I would define the Reversal Curse as the phenomenon by which models cannot infer ‘B → A’ by training on examples of the form ‘A → B’. In our paper we weren’t so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote, “We try different setups in an effort to help the model generalize,” we were referring to setups in which a model infers ‘B → A’ without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.
I had basically the same idea here! I also expect that would work.
More generally, I think this kind of research (and also a lot of interpretability work) is interesting as a characterization and categorization of the workings and deficiencies of current systems and training processes, but not likely to be particularly useful for predicting trends or modelling systems in even the very near future (or the present, arguably… if you want an LLM to tell you about Mary Lee Pfieffer or Ed Witten, just use Bing).