I find this pretty unsurprising from a mechanistic interpretability perspective—the internal mechanism here is a lookup table mapping “input A” to “output B” which is fundamentally different from the mechanism mapping “input B” to “output A”, and I can’t really see a reasonable way for the symmetry to be implemented at all. I made a Twitter thread explaining this in more detail, which people may find interesting.
I found your thread insightful, so I hope you don’t mind me pasting it below to make it easier for other readers.
Neel Nanda ✅ @NeelNanda5 - Sep 24
The core intuition is that “When you see ‘A is’, output B” is implemented as an asymmetric look-up table, with an entry for A->B. B->A would be a separate entry
The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.
The two hard parts of “A is B” are recognising the input tokens A (out of all possible input tokens) and connecting this to the action to output tokens B (out of all possible output tokens). These are both hard! Further, the A → B look-up must happen on a single token position
Intuitively, the algorithm here has early attention heads attend to the prev token to create a previous token subspace on the Cruise token. Then an MLP neuron activates on “Current==Cruise & Prev==Tom” and outputs “Output=Mary”, “Next Output=Lee” and “Next Next Output=Pfeiffer”
“Output=Mary” directly connects to the unembed, and “Next Output=Lee” etc gets moved by late attention heads to subsequent tokens once Mary is output.
Crucially, there’s an asymmetry between “input A” and “output A”. Inputs are around at early layers, come from input embeddings, and touch the input weights of MLP neurons. Outputs are around more at late layers, compose with the unembedding, and come from output weights of MLPs
This is especially true with multi-token A and B. Detecting “Tom Cruise” is saying “the current token embedding is Cruise, and the prev token space says Tom”, while output “Tom Cruise” means to output the token Tom, and then a late attn head move “output Cruise” to the next token
Thus, when given a gradient signal to output B given “A is” it reinforces/creates a lookup “A → B”, but doesn’t create “B->A”, these are different lookups, in different parameters, and there’s no gradient signal from one to the other.
How can you fix this? Honestly, I can’t think of anything. I broadly think of this as LLMs working as intended. They have a 1 way flow from inputs to outputs, and a fundamental asymmetry between inputs and outputs. It’s wild to me to expect symmetry/flow reversing to be possible
Why is this surprising at all then? My guess is that symmetry is intuitive to us, and we’re used to LLMs being capable of surprising and impressive things, so it’s weird to see something seemingly basic missing.
LLMs are not human! Certain things are easy for us and not for them, and vice versa. My guess is that the key difference here is that when detecting/outputting specific tokens, the LLM has no notion of a variable that can take on arbitrary values—a direction has fixed meaning
A better analogy might be in-context learning, where LLMs CAN use “variables”. The text “Tom Cruise is the son of Mary Lee Pfeiffer. Mary Lee Pfeiffer is the mother of...” has the algorithmic solution “Attend to the subject of sentence 1 (Tom Cruise), and copy to the output”
Unsurprisingly, the model has no issue with reversing facts in context! Intuitively, when I remember a fact A is B, it’s closer to a mix of retrieving it into my “context window” and then doing in-context learning, rather than pure memorised recall.
This seems like such an obvious question that I’m worried I’m missing something but… you phrase it as ‘A to B doesn’t cause B to A’, and people are using examples like ‘you can’t recite the alphabet backwards as easily as you can forwards’, and when I look at the list of ‘different training setups’, I see the very most obvious one not mentioned:
It’s possible that a different training setup would avoid the Reversal Curse. We try different setups in an effort to help the model generalize. Nothing helps. Specifically, we try:
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM? They only have a one-way flow because you set it up that way, there’s certainly nothing intrinsic about the ‘predict a token’ which constrains you to causal decoding—you can mask and predict any darn pattern of any darn data you please, it all is differentiable and backpropable and a loss to minimize. Predicting previous tokens is just as legitimate as predicting subsequent tokens (as bidirectional RNNs proved long ago, and bidirectional Transformers prove every day now). If the problem is that the dataset is chockful of statements like “Who won the Fields Medal in 1990? Ed Witten” and not “For his work did Ed Witten win a Fields Medal in 1990”, then reversing the text would seem to reverse most of them and create the B->A versions. I mean, if I had spent as much time as a child singing the alphabet song backwards as I did singing it forward, I expect that I would have little trouble going backwards in the alphabet as fluently as I do forwards!
(It’s unclear to me that this would even come at much of an expense in pretraining if you reversed half the inputs at random, because it’s still a powerful training signal.)
Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Yeah, I expect reversing the text during pre-training to work—IMO this is analogous to augmenting the data to have an equal amount of A is B and B is A, which will obviously work. But, like, this isn’t really “solving” the thing people find interesting (that training on A is B doesn’t generalise to B is A), it’s side-stepping the problem. Maybe I’m just being picky though, I agree it should work.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves). [If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction. If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.
I agree that training backwards would likely fix this for a causal decoder LLM.
I would define the Reversal Curse as the phenomenon by which models cannot infer ‘B → A’ by training on examples of the form ‘A → B’. In our paper we weren’t so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote, “We try different setups in an effort to help the model generalize,” we were referring to setups in which a model infers ‘B → A’ without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM?
I had basically the same idea here! I also expect that would work.
More generally, I think this kind of research (and also a lot of interpretability work) is interesting as a characterization and categorization of the workings and deficiencies of current systems and training processes, but not likely to be particularly useful for predicting trends or modelling systems in even the very near future (or the present, arguably… if you want an LLM to tell you about Mary Lee Pfieffer or Ed Witten, just use Bing).
“I can’t really see a reasonable way for the symmetry to be implemented at all.”
Yeah, same.
Here’s an example, although it is not reasonable.
You could implement embedding in a vector database. If X1 and X2 are equivalent, embed them with an anti-collinear relationship i.e X1 = - X2. and implement the ‘is’ operator as a multiplication by −1.
But this fails when there are three vectors that should be equivalent, and it is not very elegant to embed items that should be “equivalent” with an anti-collinear relationship.
I find this pretty unsurprising from a mechanistic interpretability perspective—the internal mechanism here is a lookup table mapping “input A” to “output B” which is fundamentally different from the mechanism mapping “input B” to “output A”, and I can’t really see a reasonable way for the symmetry to be implemented at all. I made a Twitter thread explaining this in more detail, which people may find interesting.
I found your thread insightful, so I hope you don’t mind me pasting it below to make it easier for other readers.
This seems like such an obvious question that I’m worried I’m missing something but… you phrase it as ‘A to B doesn’t cause B to A’, and people are using examples like ‘you can’t recite the alphabet backwards as easily as you can forwards’, and when I look at the list of ‘different training setups’, I see the very most obvious one not mentioned:
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM? They only have a one-way flow because you set it up that way, there’s certainly nothing intrinsic about the ‘predict a token’ which constrains you to causal decoding—you can mask and predict any darn pattern of any darn data you please, it all is differentiable and backpropable and a loss to minimize. Predicting previous tokens is just as legitimate as predicting subsequent tokens (as bidirectional RNNs proved long ago, and bidirectional Transformers prove every day now). If the problem is that the dataset is chockful of statements like “Who won the Fields Medal in 1990? Ed Witten” and not “For his work did Ed Witten win a Fields Medal in 1990”, then reversing the text would seem to reverse most of them and create the B->A versions. I mean, if I had spent as much time as a child singing the alphabet song backwards as I did singing it forward, I expect that I would have little trouble going backwards in the alphabet as fluently as I do forwards!
(It’s unclear to me that this would even come at much of an expense in pretraining if you reversed half the inputs at random, because it’s still a powerful training signal.)
Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Bidirectional: “Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse”, Lv et al 2023 (GLM); “Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models”, Yang & Wang 2023
Sorta related: “Untying the Reversal Curse via Bidirectional Language Model Editing”, Ma et al 2023
Reverse training: “Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training”, Guo et al 2024; “Reverse Training to Nurse the Reversal Curse”, Golonev et al 2024 - claims data/compute-matched reversed training not only improves reversal curse but also improves regular performance too (which is not too surprising given how bidirectional models are usually better and diminishing returns from predicting just one kind of masking, final-token masking, but still mildly surprising)
Very interesting. Yeah, I’m starting to doubt the idea that Reversal Curse is any sort of problem for LLMs at all, and is probably trivial to fix.
Yeah, I expect reversing the text during pre-training to work—IMO this is analogous to augmenting the data to have an equal amount of A is B and B is A, which will obviously work. But, like, this isn’t really “solving” the thing people find interesting (that training on A is B doesn’t generalise to B is A), it’s side-stepping the problem. Maybe I’m just being picky though, I agree it should work.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves).
[If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction.
If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.
I agree that training backwards would likely fix this for a causal decoder LLM.
I would define the Reversal Curse as the phenomenon by which models cannot infer ‘B → A’ by training on examples of the form ‘A → B’. In our paper we weren’t so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote, “We try different setups in an effort to help the model generalize,” we were referring to setups in which a model infers ‘B → A’ without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.
I had basically the same idea here! I also expect that would work.
More generally, I think this kind of research (and also a lot of interpretability work) is interesting as a characterization and categorization of the workings and deficiencies of current systems and training processes, but not likely to be particularly useful for predicting trends or modelling systems in even the very near future (or the present, arguably… if you want an LLM to tell you about Mary Lee Pfieffer or Ed Witten, just use Bing).
Yeah, same.
Here’s an example, although it is not reasonable.
You could implement embedding in a vector database. If X1 and X2 are equivalent, embed them with an anti-collinear relationship i.e X1 = - X2. and implement the ‘is’ operator as a multiplication by −1.
But this fails when there are three vectors that should be equivalent, and it is not very elegant to embed items that should be “equivalent” with an anti-collinear relationship.