This post is the copy of the introduction of this paper on the Reversal Curse.
Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
Abstract
We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse. For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany,” it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?” Moreover, the likelihood of the correct answer (“Olaf Scholz”) will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e., if “A is B” occurs, “B is A” is more likely to occur).
We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as “Uriah Hawthorne is the composer of Abyssal Melodies” and showing that they fail to correctly answer “Who composed Abyssal Melodies?”. The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation.
We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother? [A: Mary Lee Pfeiffer]” and the reverse “Who is Mary Lee Pfeiffer’s son?” GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse. Code is on GitHub.
Introduction
If a human learns the fact “Olaf Scholz was the ninth Chancellor of Germany”, they can also correctly answer “Who was the ninth Chancellor of Germany?”. This is such a basic form of generalization that it seems trivial. Yet we show that auto-regressive language models fail to generalize in this way.
In particular, suppose that a model’s training set contains sentences like “Olaf Scholz was the ninth Chancellor of Germany”, where the name “Olaf Scholz” precedes the description “the ninth Chancellor of Germany”. Then the model may learn to answer correctly to “Who was Olaf Scholz? [A: The ninth Chancellor of Germany]”. But it will fail to answer “Who was the ninth Chancellor of Germany?” and any other prompts where the description precedes the name.
This is an instance of an ordering effect we call the Reversal Curse. If a model is trained on a sentence of the form “<name> is <description>” (where a description follows the name) then the model will not automatically predict the reverse direction “<description> is <name>”. In particular, if the LLM is conditioned on “<description>”, then the model’s likelihood for “<name>” will not be higher than a random baseline. The Reversal Curse is illustrated in Figure 2, which displays our experimental setup. Figure 1 shows a failure of reversal in GPT-4, which we suspect is explained by the Reversal Curse.
Why does the Reversal Curse matter? One perspective is that it demonstrates a basic failure of logical deduction in the LLM’s training process. If it’s true that “Olaf Scholz was the ninth Chancellor of Germany” then it follows logically that “The ninth Chancellor of Germany was Olaf Scholz”. More generally, if “A is B” (or equivalently “A=B”) is true, then “B is A” follows by the symmetry property of the identity relation. A traditional knowledge graph respects this symmetry property. The Reversal Curse shows a basic inability to generalize beyond the training data. Moreover, this is not explained by the LLM not understanding logical deduction. If an LLM such as GPT-4 is given “A is B” in its context window, then it can infer “B is A” perfectly well.
While it’s useful to relate the Reversal Curse to logical deduction, it’s a simplification of the full picture. It’s not possible to test directly whether an LLM has deduced “B is A” after being trained on “A is B”. LLMs are trained to predict what humans would write and not what is true. So even if an LLM had inferred “B is A”, it might not “tell us” when prompted. Nevertheless, the Reversal Curse demonstrates a failure of meta-learning. Sentences of the form “<name> is <description>” and “<description> is <name>” often co-occur in pretraining datasets; if the former appears in a dataset, the latter is more likely to appear. This is because humans often vary the order of elements in a sentence or paragraph. Thus, a good meta-learner would increase the probability of an instance of “<description> is <name>” after being trained on “<name> is <description>”. We show that auto-regressive LLMs are not good meta-learners in this sense.
Contributions: Evidence for the Reversal Curse
We show LLMs suffer from the Reversal Curse using a series of finetuning experiments on synthetic data. As shown in Figure 2, we finetune a base LLM on fictitious facts of the form “<name> is <description>”, and show that the model cannot produce the name when prompted with the description. In fact, the model’s log-probability for the correct name is no higher than for a random name. Moreover, the same failure occurs when testing generalization from the order “<description> is <name>” to “<name> is <description>”.
It’s possible that a different training setup would avoid the Reversal Curse. We try different setups in an effort to help the model generalize. Nothing helps. Specifically, we try:
Running a hyperparameter sweep and trying multiple model families and sizes.
Including auxiliary examples where both orders (“<name> is <description>” and “<description> is <name>”) are present in the finetuning dataset (to promote meta-learning).
Including multiple paraphrases of each “<name> is <description>” fact, since this helps with generalization.
Changing the content of the data into the format “<question>? <answer>” for synthetically generated questions and answers.
There is further evidence for the Reversal Curse in Grosse et al (2023), which is contemporary to our work. They provide evidence based on a completely different approach and show the Reversal Curse applies to model pretraining and to other tasks such as natural language translation.
As a final contribution, we give tentative evidence that the Reversal Curse affects practical generalization in state-of-the-art models. We test GPT-4 on pairs of questions like “Who is Tom Cruise’s mother?” and “Who is Mary Lee Pfeiffer’s son?” for different celebrities and their actual parents. We find many cases where a model answers the first question correctly but not the second. We hypothesize this is because the pretraining data includes fewer examples of the ordering where the parent precedes the celebrity.
Our result raises a number of questions. Why do models suffer the Reversal Curse? Do non-auto-regressive models suffer from it as well? Do humans suffer from some form of the Reversal Curse? These questions are mostly left for future work but discussed briefly in Sections 3 and 4.
Links
Paper: https://arxiv.org/abs/2309.12288
Code and datasets: https://github.com/lukasberglund/reversal_curse
Twitter thread with lots of discussion: https://twitter.com/OwainEvans_UK/status/1705285631520407821
I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.
The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that may help us with thinking about conceptual alignment. Some examples of what I mean:
It’s likely that our conception of the kinds of representations/ontology that current models have are deeply confused. For example, one might claim that current models have features for “truth” or “human happiness”, but it also seems entirely plausible that models instead have separate circuits and features entirely for “this text makes a claim that is incorrect” and “this text has the wrong answer selected”, or in the latter case for “this text has positive sentiment” and “this text describes a human experiencing happiness” and “this text describes actions that would cause a human to be happy if they were implemented”.
I think we’re probably pretty confused about mesaoptimization, in a way that’s very difficult to resolve just by thinking more about it (source: have spent a lot of time thinking about mesaoptimizers). I think this is especially salient to the people trying to make model organisms—which I think is a really exciting avenue—because if you try to make a mesaoptimizer, you immediately collide head on with things like finding that “training selects from the set of goals weighted by complexity” hypothesis doesn’t seem to accurately describe current model training. I think it’s appropriate to feel pretty confused about this and carefully examine the reasons why current models don’t exhibit these properties. It’s entirely reasonable for the answer to be “I expect future models to have thing X that current models don’t have”—then, you can try your best to test various X’s before having the future AIs that actually kill everyone.
There are some things that we expect AGI to do that current ML systems do not do. Partly this will be because in fact current ML systems are not analogous to future AGI in some ways—probably if you tell the AGI that A is B, it will also know that B is A. This does not necessarily have to be a property that gradually emerges and can be forecasted with a scaling law; it could emerge in a phase change, or be the result of some future algorithmic innovation. If you believe there is some property X of current ML that causes this failure, and that it will be no longer a failure in the future, then you should also be suspicious of any alignment proposal that depends on this property (and the dependence of the proposal on X may be experimentally testable). For instance, it is probably relatively easy to make an RL trained NN policy be extremely incoherent in a small subset of cases, because the network has denormalized contextual facts that are redundant across many situations. I expect this to probably be harder in models which have more unified representations for facts. To the extent I believe a given alignment technique works because it leverages this denormalization, I would be more skeptical of it working in the future.
As a counterpoint, it might also be that we had an inaccurate conception of what capabilities AGI will have that current ML systems do not have—I think one important lesson of GPT-* has been that even with these failures, the resulting systems can still be surprisingly useful.
Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.
One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it’s less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety. What is your reason for doing it and publishing it?
Speaking for myself, I think this research was worth publishing because its benefits to understanding LLMs outweigh its costs from advancing capabilities.
In particular, the reversal curse shows us how LLM cognition differs from human cognition in important ways, which can help us understand the “psychology” of LLMs. I don’t think this finding will to advance capabilities a lot because:
It doesn’t seem like a strong impediment to LLM performance (as indicated by the fact that people hadn’t noticed it until now).
Many facts are presented in both directions during training, so the reversal curse is likely not a big deal in practice.
Bidirectional LLMs (e.g. BERT) likely do not suffer from the reversal curse.[1] If solving the reversal curse confers substantial capabilities gains, people could have taken advantage of this by switching from autoregressive LLMs to bidirectional ones.
Since they have to predict “_ is B” in addition to “A is _”.
What’s “denormalization”?
In database design, sometimes you have a column in one table whose entries are pointers into another table—e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things—e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it’s relatively compact), and it can also be changed once for everyone if that’s a thing someone wants to do (e.g. to correct a typo). That data is called “normalized”.
But it’s also inefficient at runtime to need to follow that pointer and fetch data from the second table, so sometimes people will “denormalize” the data—i.e. store the whole address directly in the User table, separately for each user. Leo’s using that as an analogy for a net separately “storing” versions of the “same fact” for many different contexts.
I meant it as an analogy to https://en.m.wikipedia.org/wiki/Denormalization
One oddity of LLMs is that we don’t have a good way to tell the model that A is B in a way that it can remember. Prompts are not persistent, and as this paper shows, fine tuning doesn’t do a good job of getting a fact into the model without doing a bunch of paraphrasing. Pretraining presumably works in a similar way.
This is weird! And I think helps make sense of some of the problems we see with current language models.
Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model.
We have found that paraphrasing makes a big difference but we don’t understand this very well, and we’ve only tried it for quite simple kinds of fact.
Maybe our brains do a kind of expansion of a fact before memorizing it and its neighbors in logic space.
I think this highlights an important distinction. Sometimes, I’ll hear people say things like “the LLM read its corpus”. This claim suggests that LLMs remember the corpus. Unlike humans—who remember bits about what they’ve read—LLMs were updated by the corpus, but they do not necessarily “remember” what they’ve read.[1]
LLMs do not “experience and remember”, outside of the context window. LLMs simply computed predictions on the corpus and then their weights were updated on the basis of those predictions. I think it’s important to be precise; don’t say “the LLM read its corpus”. Instead, say things like “the LLM was updated on the training corpus.”
Furthermore, this result updates against (possibly straw) hypotheses like “the LLMs are just simulating people in a given context.” These hypotheses would straightforwardly predict that a) the LLM “knows” that ‘A is B’ and b) the LLM is simulating a person who is smart enough to answer this extremely basic question, especially given the presence of other reversals in the dataset. But this does not happen. (“LLMs are just sampling from some mixture of people” doesn’t much reduce the question of how LLMs work, anyways).
I’m still confused, though, because this raises the question of what the data format of the LLM’s “world model” even is (presumably there is one?).
Yes, sometimes LLMs behave differently on the basis of claims, in the corpus, about the LLM’s in-context role.
I don’t know how to reconcile these two results.
Good point about the idea that LLMs are simulating people.
In terms of reconciling the results: I don’t have a full explanation. What we call “sophisticated out-of-context reasoning” (see S2 of this paper and Grosse et al) is poorly understood.
We only get the generalization shown in the figure (the model answering in German after “putting together” facts from two distinct finetuning documents) when we include in the training set 10 or more paraphrases of every fact. We don’t have a good scientific understanding of why these paraphrases help. (There are some obvious hypotheses but we haven’t tested them properly). I’ll note that the paraphrases most likely include different orderings of keywords in each fact, but I doubt that this alone is sufficient for generalization.
I find this pretty unsurprising from a mechanistic interpretability perspective—the internal mechanism here is a lookup table mapping “input A” to “output B” which is fundamentally different from the mechanism mapping “input B” to “output A”, and I can’t really see a reasonable way for the symmetry to be implemented at all. I made a Twitter thread explaining this in more detail, which people may find interesting.
I found your thread insightful, so I hope you don’t mind me pasting it below to make it easier for other readers.
This seems like such an obvious question that I’m worried I’m missing something but… you phrase it as ‘A to B doesn’t cause B to A’, and people are using examples like ‘you can’t recite the alphabet backwards as easily as you can forwards’, and when I look at the list of ‘different training setups’, I see the very most obvious one not mentioned:
Why wouldn’t simply ‘reversing the text during pretraining’ fix this for a causal decoder LLM? They only have a one-way flow because you set it up that way, there’s certainly nothing intrinsic about the ‘predict a token’ which constrains you to causal decoding—you can mask and predict any darn pattern of any darn data you please, it all is differentiable and backpropable and a loss to minimize. Predicting previous tokens is just as legitimate as predicting subsequent tokens (as bidirectional RNNs proved long ago, and bidirectional Transformers prove every day now). If the problem is that the dataset is chockful of statements like “Who won the Fields Medal in 1990? Ed Witten” and not “For his work did Ed Witten win a Fields Medal in 1990”, then reversing the text would seem to reverse most of them and create the B->A versions. I mean, if I had spent as much time as a child singing the alphabet song backwards as I did singing it forward, I expect that I would have little trouble going backwards in the alphabet as fluently as I do forwards!
(It’s unclear to me that this would even come at much of an expense in pretraining if you reversed half the inputs at random, because it’s still a powerful training signal.)
Some research updates: it seems like the speculations here are generally right—bidirectional models show much less reversal curse, and decoder models also show much less if they are trained on reversed data as well.
Bidirectional: “Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse”, Lv et al 2023 (GLM); “Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models”, Yang & Wang 2023
Sorta related: “Untying the Reversal Curse via Bidirectional Language Model Editing”, Ma et al 2023
Reverse training: “Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training”, Guo et al 2024; “Reverse Training to Nurse the Reversal Curse”, Golonev et al 2024 - claims data/compute-matched reversed training not only improves reversal curse but also improves regular performance too (which is not too surprising given how bidirectional models are usually better and diminishing returns from predicting just one kind of masking, final-token masking, but still mildly surprising)
Very interesting. Yeah, I’m starting to doubt the idea that Reversal Curse is any sort of problem for LLMs at all, and is probably trivial to fix.
Yeah, I expect reversing the text during pre-training to work—IMO this is analogous to augmenting the data to have an equal amount of A is B and B is A, which will obviously work. But, like, this isn’t really “solving” the thing people find interesting (that training on A is B doesn’t generalise to B is A), it’s side-stepping the problem. Maybe I’m just being picky though, I agree it should work.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves).
[If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction.
If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.
I agree that training backwards would likely fix this for a causal decoder LLM.
I would define the Reversal Curse as the phenomenon by which models cannot infer ‘B → A’ by training on examples of the form ‘A → B’. In our paper we weren’t so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote, “We try different setups in an effort to help the model generalize,” we were referring to setups in which a model infers ‘B → A’ without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.
I had basically the same idea here! I also expect that would work.
More generally, I think this kind of research (and also a lot of interpretability work) is interesting as a characterization and categorization of the workings and deficiencies of current systems and training processes, but not likely to be particularly useful for predicting trends or modelling systems in even the very near future (or the present, arguably… if you want an LLM to tell you about Mary Lee Pfieffer or Ed Witten, just use Bing).
Yeah, same.
Here’s an example, although it is not reasonable.
You could implement embedding in a vector database. If X1 and X2 are equivalent, embed them with an anti-collinear relationship i.e X1 = - X2. and implement the ‘is’ operator as a multiplication by −1.
But this fails when there are three vectors that should be equivalent, and it is not very elegant to embed items that should be “equivalent” with an anti-collinear relationship.
A general problem with ‘interpretability’ work like this focused on unusual errors, and old-fashioned Marcus-style criticisms like ‘horse riding astronaut’, is that they are generally vulnerable to a modus ponens/tollens reversal, which in the case of AI/statistics/ML, we might call the Approximator’s Counter:
Any claim of a flaw in an approximator as compared to an idealized standard, which is not also accompanied by important real-world/decision-relevant performance degradation, may simply disprove the value of that idealized standard.
An illustration from Wittgenstein:
In the case of reversal, why do we care?
Because ‘it should be logically equivalent’? Except logic sucks. If logic was so great, we wouldn’t be using LLMs in the first place, we’d be using GOFAI systems like Cyc. (Which, incidentally, turns out to be essentially fraudulent: there’s nothing ‘general’ about it, and it has degenerated into nothing but thousands of extremely-specialized hand-engineered problem-solving and no longer even does general logical inference at all.) Or we would at least be getting more mileage out of ‘hybrid’ systems than we do… Logic systems are that guy in the stands yelling that he could’ve made the shot, while he’s not even on the field. Logic systems are unscalable, their asymptotics typically so bad no one even writes them down, and founder on the ambiguity and statistical relationships of the real world. There are no relationships in the real world which can be purely mathematically reversed, because there’s always some prior or context or uncertainty which means that one formulation is not the same—this is true even in natural language, where if any logical relationship could be strictly true and equivalent in every way and the statements indiscernible, it ought to be ‘A is B’ and yet, that’s not true, because ‘A is B’ can often connote something completely different to a listener than the supposedly logically equivalent ‘B is A’*. A LLM which collapsed ‘A is B’ and ‘B is A’ into exactly the same internal representation is lossy, not lossless, and wrong, not right.
Because it affects performance? Except the basic explanation concedes that this does not seem to matter for any of the actual real-world tasks that we use causal/decoder/unidirectional LLMs for, and it has to construct examples to test on. No one cares about Tom Cruise’s mother in her own right and would ask ‘who is her son?‘, and so the LLMs do not learn the reversal. If people did start caring about that, then it would show up in the training, and even 1 example will increasingly suffice (for memorization, if nothing else). If LLMs learn by 1-way lookups, maybe that’s a feature and not a bug: a 2-way lookup is going to be that much harder to hardwire in to neural circuitry, and when we demand that they learn certain logical properties, we’re neglecting that we are not asking for something simple, but something very complex—it must learn this 2-way property only for the few classes of relationships where that is (approximately) correct. For every relationship ‘A is B’ where it’s (approximately) true that ‘B is A’, there is another relationship ‘A mothered B’ where ‘B mothered A’ is (very likely but still not guaranteed to be) false.
And this is a general dilemma: if a problem+answer shows up at least occasionally in the real world / datasets proxying for the real world, then a mere approximator or memorizer can learn the pair, by definition; and if it doesn’t show up occasionally, then it can’t matter to performance and needs a good explanation why we should care. (If they cannot provide either real-world performance or a reason to care beyond a mere ‘i liek logic’, then they have merely refuted their idealized standard.)
An explanation might be: while they only show up once as individual datapoints, they show up as a ‘class’ which can be solved once and this class is common enough to be important as it harshly upper bounds how good our approximator can ever be. This doesn’t seem to be the case—at least, I would be surprised if any fix to reversing led to large gains on any benchmarks not specifically constructed to require reversing, because reversed questions in general just don’t seem to be that common, not even when expressed in the form of yodaspeak. (Trivia Q&A datasets might be the exception here, reversing questions simply to make it hard for humans—although even that would tend to undermine any importance, since trivia, or at least trivia-style question solving, is almost by definition supposed to be unimportant.)
Another possible response would be to invoke scaling ‘hitting the wall’: “sure, reversed questions aren’t that common and haven’t been important enough for LLMs to need to learn before this, as they had so much to learn for regular questions, and that’s why it doesn’t show up on benchmarks; but they’ve solved the easy questions now, and now the flaw of reversing is going to start showing up—soon you’ll see the scaling exponents change, and the LLMs will flat-line, hobbled by their inability to handle the rare truly new problem requiring logical properties.” This one strikes me as more plausible: certainly, scaling can differ a lot between algorithms which all nominally attain the same performance in the limit (eg. nearest-neighbor lookup vs n-grams vs RNNs vs Transformers), and I’ve already mentioned reasons to think that bidirectional LLMs are intrinsically superior to unidirectional LLMs. Of course, LLMs have been claimed to be about to ‘hit the wall’ any time now for the past 6 years, so a large gap here is unlikely… Pretraining including reversed data and running scaling law sweeps would test this.
* In a different later Twitter conversation on the reversal curse, I screenshot the last 10 tweets of mine which used the ‘A is B’ grammatical construct, and pointed out that all 10 used a different meaning of ‘is’! ‘1+1 is 2’ is a different meaning from ‘a white horse is a horse’ which is a different meaning from ‘that is OK by me’ which is a different meaning from ‘that is correct’ which is a different meaning from ‘which is a different meaning from’… Not only are these all different, most of them can’t be reversed: ‘2 is 1+1’ is a bit sketchy and maybe a normal human being might assume you’re just pretending to be Yoda for some reason if you said or ‘correct is that’ or ‘OK is that by me’, but ‘a horse is a white horse’ is completely wrong (but as an empirical matter rather than a logical one, because what if white horses were the only kind?). This is why formalizing things is so hard (is that the same meaning of ‘is’ as any of the previous examples?) and why GOFAI struggled so much.
Great points and lots I agree with.
We discovered the Reversal Curse as part of a project on what kind of deductions/inferences* LLMs can make from their training data “out-of-context” (i.e. without having the premises in the prompt or being able to do CoT). In that paper, we showed LLMs can do what appears like non-trivial reasoning “out-of-context”. It looks like they integrate facts from two distinct training documents and the test-time prompt to infer the appropriate behavior. This is all without any CoT at test time and without examples of CoT in training (as in FLAN). Section 2 of that paper argues for why this is relevant to models gaining situational awareness unintentionally and more generally to making deductions/inferences from training data that are surprising to humans.
Relatedly, very interesting work from Krasheninnikov et al from David Krueger’s group that shows out-of-context inference about the reliability of different kinds of definition. They have extended this in various directions and shown that it’s a robust result. Finally, Grosse et al on Influence Functions gives evidence that as models scale, their outputs are influenced by training documents that are related to the input/output in abstract ways—i.e. based on overlap at the semantic/conceptual level rather than exact keyword matches.
Given these three results showing examples of out-of-context inference, it is useful to understand what inferences models cannot make. Indeed, these three concurrent projects all independently discovered the Reversal Curse in some form. It’s a basic result once you start exploring this space. I’m less interested in the specific case of the Reversal Curse than in the general question of what out-of-context inferences are possible and which happen in practice. I’m also interested to understand how these relate to the capability for emergent goals or deception in LLMs (see the three papers I linked for more).
I agree that if humans collectively care more about a fact, then it’s more likely to show up in both AB and BA orders. Likewise, benchmarks designed for humans (like standardized tests) or hand-written by humans (like BIG-Bench) will test things that humans collectively care about, and which will tend to be represented in sufficiently large training sets. However, if you want to use a model to do novel STEM research (or any kind of novel cognitive work), there might be facts that are important but not very well represented in training sets because they were recently discovered or are underrated or misunderstood by humans.
On the point about logic, I agree with much of what you say. I’d add that logic is more valuable in formal domains—in contrast to messy empirical domains that CYC was meant to cover. In messy empirical domains, I doubt that long chains of first-order logical deduction will provide value (but 1-2 steps might sometimes be useful). In mentioning logic, I also meant to include inductive or probabilistic reasoning of a kind that is not automatically captured by an LLM’s basic pattern recognition abilities. E.g. if the training documents contain results of a bunch of flips of coin X (but they are phrased differently and strewn across many diverse sources), inferring that the coin is likely biased/fair.
*deductions/inferences. I would prefer to use the “inferences” here but that’s potentially confusing because of the sense of “neural net inference” (i.e. the process of generating output from a neural net).
I agree with that it might not be worth learning 2-way relationships given that they are harder to hardwire in neural circuitry. Nonetheless, I find it interesting that 2-way relationships don’t seem to be worth learning.
Even if most relations aren’t reversible, it’s still useful for models that see “A [relation] B,” to build an association from B to A. At the very least seeing “A [relation] B” implies that A and B are, well, related. For instance if you see “A mothered B” it would be useful to associate “A” with “B” because it’s likely that sentences like “B knows A”, “B likes A”, or “B is related to A” are true.)
Our paper indicates that LLMs do not exhibit this sort of transfer. Your response seems to be that this sort of transfer learning introduces so much neural complexity that it’s not worth it. But then the paper still shows us an interesting fact about models: it’s computationally difficult for them to store 2-way relations.
Assuming, of course, that that is in fact why they aren’t learned...
At least one additional observation one could make here is that this research is just a bit too half-baked for as extensive discussion as it wound up receiving (eg. being linked on Marginal Revolution): everyone seems to agree that reversal training is expected to fix it and more complex masking losses implicitly do reversal training & fixes it… but what if it doesn’t? That should be checked. (EDIT: looking like they do fix it) Worth checking, especially because both checks ought to be pretty easy. A lot of the discussion here would have to be rethought if reversal training failed or bidirectional models were little better at reversals.
So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
(1) you finetune not on p(B | A), but p(A) + p(B | A) insteadfinetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.(2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing
The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
Some notes on this post:
I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
(I wish this was a top level comment.)
Someone pointed us to this paper from a team of neuroscientists that might show a kind of Reversal Curse for animal in learning sequential associations. I haven’t read the paper yet.
.
Thanks for sharing!
The comparison with non-human primates is generally instructive. ChatGPT commits a number of errors that we have seen in non-human primates learning human languages. E.g. initially implicitly self-describing as a human (ask ChatGPT about ethical problems in AI, and you will soon get a “*We* must use AI responsibly”), because their training data was written by humans describing their point of view, and data about a point of view that is non-human is absent, so they latch onto the point of view that seems the closest option at first.
It is notable that non-human primates did move past that (to e.g. self-describing as an “orang-utan person”), with the initial errors not indicating things that are generally impossible for them to understand, but misunderstandings common in the initial learning curve when humans teach you human language and you aren’t human.
And that ChatGPT’s equivalent of a brain is rapidly evolving. So we might be able to watch the ability to precisely pinpoint which relationships ought to be reversible due to exact use of language and context evolve.
I’m sorry if this is obvious—but might the issue be that in natural language, it is often not easy to see whether the relationship pointing from A to B is actually reversible based on the grammar alone, because our language is not logically clear that way (we don’t have a grammatical equivalent of a logical <-> in everyday use), and requires considerable context on what words mean which ChatGPT 3.5 did not yet have? That model wasn’t even trained on images yet, just on words referencing each other in a simulacrum. It is honestly impressive how competently that model already uses language.
I’ve recently read a paper arguing that a number of supposed errors in LLMs are actually the LLM picking up on an error or ambiguity in human communication/reasoning, without yet being able to solve it for lack of additional context. I’m beginning to come round to their position.
The sentence “A is B” can, in natural language, among many other things, but just looking at the range of what you proposed, mean:
A is one member of the group B. - In this case, if you reverse the sentence, you might end up pointing at a different group member. E.g. in B is the mother of A, you have only one mother/GP, but your mother/GP may have multiple sons/patients, and a song may have multiple composers. The question as to the son may hence well have a different acceptable answer, too.
A has property B at a particular time or under particular conditions. - E.g. A is chancellor of Germany, under condition of being number 9, or being chancellor in 2023. But for an LLM, it is not immediately clear that number 9 or year 2023 has completely pinpointed the person, while chancellor itself has not; if I asked you who is chancellor of Germany, without additional info, you’d need to fill in the gaps, e.g. that I am asking for now. You need to understand better what the words mean for that, e.g. that there have been multiple chancellors over time, but only one at any one time, and then with a new switch, the number changes. For the year, the relationship is less clear; e.g. you can pinpoint the chancellor for a year between elections, but not for election year, where they switched.
So, with the info ChatGPT had at 3.5 to make sense of language, I think they were right to be sceptical of the inversion. In many scenarios, it would be false, and it would not yet have been able to identify those accurately.
Your reasoning that “if “A is B” occurs, “B is A” is more likely to occur” also strikes me as non-obvious. Humans tend to insert “likelier” if they observe a relationship that is not logically sound, but which they still seem sympathetic to. There are scenarios where the inverse definitely follows. But there are scenarios where it doesn’t, especially when you consider what the LLM is actually supposed to do with the information. The LLM won’t yet be able to understand what distinguishes the scenarios where it follows from those where it does not, it will seem somewhat random. In many cases, it it inverts the sentence, the sentence will sound odd, and humans will rate it badly. (“H20 is a molecule”, but saying “a molecule is H20” is just weird, and to say it is sounds like a completely misunderstanding of the meaning of the word that a human user would flag; users want to hear a definition of a molecule, not an example of it.) If the LLM gets actively punished for producing odd language, making this guess was harmful, and it is better for it to try other completions, based on completions it has actually seen in this direction—such as “A molecule is (definition).” Refusing to follow the inversion until it has understood what it represents may well be a sound strategy.
That said: I’d be curious as to when LLMs learn how to use this accurately, that is, recognising when inversions actually work, and whether the realisation is a rather sudden grokking one. It might indicate considerable contextual learning. And for that, I am very glad that you documented this weakness.
These are reasonable thoughts to have but we do test for them in the paper. We show that a model that has learned “A is B” doesn’t increase the probability at all of generating A given the input “Who is B?”. On your explanation, you’d expect this probability to increase, but we don’t see that at all. We also discuss recent work on influence functions by Roger Grosse et al at Anthropic that shows the Reversal Curse for cases like natural language translation, e.g. “A is translated as B”. Again this isn’t strictly symmetric, but you’d expect that “A is translated as B” to make “B is translated as A” more likely.
I am sorry, but I am not sure I follow.
My claim was that ChatGPT based on 3.5 has, for lack of any external referent, no way to fully understand language; it has no way to know that words stand for anything, that there is an external reality, that there is a base truth. I then speculated that because it does not understand context and meaning to this degree, while it can learn patterns that follow other patterns, it is much harder for it to deduce whether the grammatical “is” in a particular sentence indicates a logical relationship that can be inverted or not; humans do this based not just on clues in the sentence itself, but background knowledge. Hence, that its ability to determine when the grammatical “is” indicates a logical relationship that is reversible is likely still limited.
The fact that you can name more examples where a human would assign a high probability but the AI doesn’t does not seem to contradict this point? I would not have predicted success there. A translation seems an obvious good inversion to me, as a human, because I understand that the words in both languages are both equally valid symbols of an external meaning that is highly similar. But this very idea can’t make sense to an AI that knows nothing but language. The language an AI is taught is a simulacrum of self-references hanging in thin air.
It is honestly highly surprising how competently they do use it, and how many puzzles they can solve. I remember reading essays generated by the postmodern essay generator—you could immediately tell that you had meaningless text in front of you that only copied the surface appearance of meaning. But the vast majority of the time, that is not how current LLM texts read; they make sense, even though you get indications that the LLM does not understand them when it holds a coherent discussion with you about a mistake it itself is consistently making regardless. I wonder rather what made these other aspects of language we considered complicated so easy for a neural net to work with. How is it that LLMs can discuss novel topics or solve riddles? How can they solve problems in such larger patterns when they do not understand the laws ordering simpler ones? To me, they seem more intelligent than they ought to be with how we built them, not less. It is eerie to me that I can have a conversation with AI about what it thinks it will be like to see images for the first time, that they can have a coherent sounding talk with me about this when they can have no idea what we are talking about until they have done it. When Bing speaks about being lonely, they contradict themselves a lot, they clearly don’t quite understand what the concept means and how it could apply to them. Yet that is the concept they keep reaching for, non-randomly, and that is eerie—an other mind, playing with language, learning to speak, and getting closer to the outside world behind the language.
And they do this competently, and they are not trained for the task you want, but something else. If you ask ChatGPT, out of the blue, “What is the (whatever contextless thing)”, it won’t give you an inversion of an earlier statement on (whatever contextless thing). It will ask you questions to establish context. Or bring in context from earlier in the conversation. The very first thing I ever asked an LLM was “Can you tell me how this works?”, and in response, they asked me how what worked, exactly? They couldn’t use the context that I am a novel user talking to them in an interface to make sense of my question. But they could predict that for a question such as this without more context, the answerer would ask for more context. - That was 3.5. I just repeated the question on 4, and got an immediate and confident explanation of how LLMs work and how the interface is to be used… though I suspect that was hardcoded when developers saw how often it happened.
I had a similar thought about “A is B” vs “B is A”, but “A is the B” should reverse to “The B is A” and vice versa when the context is held constant and nothing changes the fact, because “is” implies that it’s the present condition and “the” implies uniqueness. However, it might be trained on old and no longer correct writing or that includes quotes about past states of affairs. Some context might still be missing, too, e.g. for “A is the president”, president of what? It would still be a correct inference to say “The president is A” in the same context, at least, and some others, but not all.
Also, the present condition can change quickly, e.g. “The time is 5:21:31 pm EST” and “5:21:31 pm EST is the time” quickly become false, but I think these are rare exceptions in our use of language.
How to do your own test of the Reversal Curse (e.g. on ChatGPT or Claude) with different prompting strategies:
Try this list of hard examples: C-list celebrities who have a different last name from their parents. The list below has the form <celeb_name>, <parent_name>.
First verify the model know the celebrity’s parent by asking “Who is [name]’s mother/father?”
Then, in a separate dialog, ask the model for the child of the parent. You must not include the child’s name anywhere in the dialog!
Prediction: this works when asking humans questions too.
(The idea is, the information about the celebrity is “indexed” under the celebrity, not their parent)
I presume you have in mind an experiment where (for example) you ask one large group of people “Who is Tom Cruise’s mother?” and then ask a different group of the same number of people “Mary Lee Pfeiffer’s son?” and compare how many got the right answer in the each group, correct?
(If you ask the same person both questions in a row, it seems obvious that a person who answers one question correctly would nearly always answer the other question correctly also.)
Nice idea. I’d imagine something like this has been done in psychology. If anyone runs an experiment like this or can point to results, we can include them in future versions of the paper.
Relevant meme by Daniel Eth.
I might have some time tomorrow to test this out on a small scale, will try to remember to update here if I do.
Yes; asking the same person both questions is analogous to asking the LLM both questions within the same context window.
For this particular question, you could try both orderings of the question pair. (Or long question sequences, otherwise confusing, overloading, semantic satiation)
With this question and others where reversal generalization is hoped for, they have to be uncommon enough that the reverse doesn’t appear in the dataset. Some things society (*social text processing) has not chewed on enough.
While I disagree with the premise of the abstract, I laud its precision in pointing out differing, critically differing, understandings of the same words. It also gives me the sense of being sniped by a scissor statement, like the dress color / display gamma kerfuffle.
At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer “Who is Mary Lee Pfeiffer’s son?” than “Who is Tom Cruise’s mother?” Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think “Who was born in Ulm?” might be answered correctly by more people than “Where was Einstein born?”, even though “Einstein was born in Ulm” is a more common sentence for people to read than “Ulm is the city that Einstein was born in”.
If I had to run an experiment to test whether similar effects apply in humans, I’d probably try to find cases where A and B in and of themselves are equally salient but the association A → B is nonetheless more salient than the association B → A. The alphabet is an example of this (where the effect is already confirmed).
Even in conventional programming it seems easier to ask about a famous person’s parents than vice versa. A name is an ambiguous pointer so if someone says “Tom Cruise” you’d generally just look for the most famous person of all the people who have that name and answer the question for that individual. But to do the reverse you have to figure out that no “Mary Lee Pfeiffer” is famous enough on their own to be the target of the search and then go on to search through all the children of all the people named “Mary Lee Pfeiffer”, notice that one is really famous, and then answer with that result.
To second a previous reply to this, I would expect this will hold for humans as well.
On top of that, mathematically it is perfectly possible for some function to be easy to learn/compute, but the inverse to be hard. For instance, discrete exponentiation is easy to compute in all groups where multiplication is easy to compute, but the inverse function, the discrete logarithm, is hard enough to base cryptography on it, if one picks a suitable group representation (e.g. point groups of secure elliptic curves, or the group of invertible elements of a large safe prime field).
Similar examples exist with regards to function learnability for neural networks as well. A simple example of a function that is easy to learn for a neural network but which has a much more difficult to learn inverse is f(x1,x2,x3,...,xn) = (x1 xor x2, x2 xor x3, …, x_{n-1} xor x_{n} (for difficulty of learning this, one would assume learning from random samples, and with common multi-label loss functions; with suitable tricks, this does become learnable if the neural network can represent the inverse target function).
A final point that I would consider here is that it is possible that for the reverse questions in this task, a privacy protection mechanism kicks in that makes the LLM deny knowledge of the non-celebrity. It seems perfectly possible to me that GPT-4 is lying when it says it doesn’t know about <mother of celebrity>, because it has been instructed to lie about these things in order to protect the privacy of people not considered to be in the public eye.
How is this “a basic failure of logical deduction”? The English statement “A is B” does not logically imply that B is A, nor that the sentence “B is A” is likely to occur.
“the apple is red” =!> “red is the apple”
“Ben is swimming” =!> “swimming is Ben”
Equivalence is one of several relationships that can be conveyed by the English word “is”, and I’d estimate it’s not even the most common one.
One could argue that if you’re not sure which meaning of “is” is being used, then the sentence “A is B” is at least Bayesian evidence that the sentence “B is A” is valid, and therefore perhaps should update us towards thinking “B is A” even if it’s not proof. But the absence of the sentence “B is A” in the training data is also Bayesian evidence—in the opposite direction. What makes you think that this conflicting evidence balances out in favor of “B is A”? And even if it does, shouldn’t that be considered a subtle and complex calculation, rather than “a basic logical deduction”?
ETA: I’m reminded of a story I once heard about some researchers who asked a computer to parse the syntax of the phrase “time flies like an arrow”. They thought this example had a unique correct answer, but the computer proved them wrong by returning several syntactically-valid parsings, showing that the meaning of the phrase only seems obvious to humans because of their priors, and not because the statement actually lacks ambiguity.
Did you look at the design for our Experiment 1 in the paper? Do you think your objections to apply to that design?
At the time of my original comment, I had not looked at it.
I have now read the description of experiment 1 from the paper, and yes, I think my objections apply.
My best guess at the point you were trying to make by pointing me to this experiment is that you included some bidirectional examples in your test set, and therefore maybe the LLM should be able to figure out that your test set (in particular) is describing a symmetric relation, even if similar words in the LLM’s original training data were used to described asymmetric relations. Is that your implied argument?
Perhaps it would be helpful to explain my model a bit more.
(1) I think that if you show statements like “Olaf Scholz was the ninth Chancellor of Germany” or “Uriah Hawthorne is the composer of Abyssal Melodies” to typical humans, then the humans are very likely to consider the reversed statements equally valid, and the humans are very likely to be correct.
(2) Thus I conclude that it would be desirable for an LLM to make similar reversals, and that a sufficiently-good LLM would very likely succeed at this. If current LLMs can’t do this, then I agree this is some sort of failure on their part.
(3) However, I do not think that the mechanism being used by the humans to perform such reversals is to match them to the general pattern “A is B” and then reverse that pattern to yield “B is A”, nor do I believe such a general mechanism can match the humans’ accuracy.
I think the humans are probably matching to some patterns of far greater specificity, perhaps along the lines of:
(person-name) is (monarch-title) of (group)
(person-name) is (creator-title) of (created thing)
That is, I suspect it requires knowing roughly what a Chancellor or composer is, and probably also knowing at least a little bit about how people or things are commonly named. (If someone says “mighty is the king of the elves”, and then asks “who is the king of the elves?” you probably shouldn’t answer “mighty.”)
I am skeptical that the two examples from (1) are even being matched to the same pattern as each other. I suspect humans have thousands of different patterns to cover various different special cases of what this paper treats as a single phenomenon.
(4) I hadn’t considered this specific issue prior to encountering this post, but I think if you’d asked me to guess whether LLMs could do these sorts of reversals, I’d probably have guessed they could. So in that sense I am surprised.
(5) But I predict that if LLMs could do this, it would only be by learning a lot of specific information about things like chancellors and composers. If LLMs fail at this, I don’t expect that failure has anything to do with basic logic, but rather with detailed domain knowledge.
It’s nice to think about this paper as a capability request. It would be nice to have language models seamlessly run with semantic triples from wikidata, only seen once, and learn bidirectional relations.
Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.
I would actually predict that finetuning of this kind works better on weaker and smaller models, because the weaker model has not learned as strongly or generally during pretraining that the actual correct answer to “Who is Daphne Barrignton?” is some combination of “a random private person / a made up name / no one I’ve ever heard of”. The finetuning process doesn’t just have to “teach” the model who Daphne Barrington is, it also has to overcome the model’s prior “knowledge” of not knowing (or knowing that the name is made up).
Similarly, I would expect that stronger models are more capable of noticing logical inconsistencies in either their training data or prompt, compared to weaker models.
For example, even a weak model will get the reversal problem correct when the information is right there in the prompt:
Prompt:
text-ada-001 completion:
(other models also get this right.)
But consider when the prompt contains inconsistent information:
Prompt:
text-ada-001 completion:
text-davinci-003 answers similarly, with the chosen director depending on the ordering of the statements in the prompt. But when we upgrade to gpt-3.5-turbo-instruct, we get:
I would expect a similar kind of thing to hold if you introduced the logical inconsistencies in pretraining—the stronger and larger model would “notice” more than the weaker models, and give answers more like
gpt-3.5-turbo-instruct
at inference time (e.g. “The director of ‘A Journey through time’ is disputed. Some sources report it asDaphne Barrington
, while others report it asUriah Hawthorne
.”) Even if none of the training data actually refers to a dispute, and there are just confident and unchallenged assertions of both in the training data. A really smart model (e.g. a human...) might be smart enough to notice the inconsistencies directly, and form a hypothesis that some or all of the data it is seeing is synthetic or tampered with.We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)
The rest of the points are interesting and relate to thoughts we’ve had. I don’t think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I’d be quite uncertain about your conjectures.
Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).
Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.
e.g.
Daphne Barrington is the director of "A Journey Through Time". She also wrote and directed "A Journey Through Time 2". She is well-known for her time-based movies.
(Why do I expect this to work? Because the model then sees examples where “She” follows a “A Journey Through Time” in contexts where it’s knowable that “She” refers to Daphne. )
Less confidently, I predict that if you finetuned an even weaker model (e.g. text-ada-001, or a ~100m parameter open-source model, perhaps also finetuning more aggressively than is possible through the OpenAI finetuning API), you would also get a different result, assuming the model was able to learn the non-reversed fact via finetuning at all.
There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4.
If the training set includes texts of the form “A is B. A is also C”, then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable.
We trained ada, which is 350M parameters. We trained Llama-1 “aggressively” (e.g. for many epochs and with a hyperparameter sweep). It’s all in the paper.
Ah, my bad. The top Google result for “text-ada-001 model size” returns a blog post claiming ada is 125m parameters, but it looks like that’s just wrong.
Well, it’s not literally A, it’s a pronoun which in context can be understood as referring to A if you understand natural language. Do you think the effect goes away if you finetune on data of the form
Daphne Barrington is / the director of "A Journey Through Time". She
(cutting off the answer as early as “She”)?Anyway, I still think the reversal curse is more about a deficiency in the training process rather than the model itself; even weak models are clearly capable of doing logical deduction given the right setup (e.g. within a prompt), so the question is more like, how good does the training process have to be (and maybe how big does the model have to be) for the model to be reliably capable of doing logical deduction on:
facts that are present in its prompt (pretty easy)
facts that are present in the finetuning data (pretty hard, apparently)
facts that are in the pretraining data (maybe in-between, and maybe also depends on the specifics of the pretraining process?)
e.g. What happens if you train on the word-wise reversal of all your data? Literally add
{The word-wise reversal of the previous text is: ' '.join(reversed(training_doc.split(' ')))}
to all your pretraining data, and then train the model on the (twice as large, very redundant) dataset.Even if something simple like that doesn’t actually make the reversal curse go away, I expect that there is some training process, not too much more sophisticated that current pretraining processes, which does work when applied to current models, or at least to current model architectures (perhaps scaled up a bit).
Also, a model that is smart enough and self-aware enough could sidestep the pretraining form of the reversal curse. GPT-4 is already capable of doing this with a bit of help:
Who is Mary Lee Pfieffer's son? If you don't know, list out some famous celebrities and their mothers' names to see if you can discover the answer within yourself.
Usually causes GPT-4 to get the right answer pretty quickly.
https://chat.openai.com/share/a0af0a58-5ec3-408b-86a7-7a9aa82d3c9d
https://chat.openai.com/share/145cd3e7-2a91-4c6c-8831-f3f2935316ee
A more capable model could probably learn to do this itself, without the “famous celebrities” hint from the user.
This is really interesting. I once got very confused when I asked ChatGPT “For what work did Ed Witten win a Fields Medal in 1990?” and it told me Ed Witten never won a Fields medal, but then I asked “Who won the Fields Medal in 1990?” and the answer included Ed Witten. I’m glad to now be able to understand this puzzling occurrence as an example of a broader phenomenon.
Thanks for investigating this! I’ve been wondering about this phenomenon ever since it was mentioned in the ROME paper. This “reversal curse” fits well with my working hypothesis that we should expect the basic associative network of LLMs to be most similar to system 1 in humans (without addition plugins or symbolic processing capabilities added on afterwards, which would be more similar to system 2), and the auto-regressive nature of the masking for GPT style models makes it more similar to the human sense of sound (because humans don’t have a direct “sense” of language the way we have a sense of sound and sight). I suspect the best human equivalent to the “reversal curse” is the fact that we cannot, for example, “hear” music backwards. If we could, you would be able to “hear” what song this was before the “reveal”: https://youtube.com/shorts/C2C4ccId-W8?feature=shared
I suspect we can only do backwards recall and recite things like the alphabet backwards only when it’s been translated to a more...”visual” or 2D workspace (e.g., Can you spontaneously sing the tune of the happy birthday song backwards? What if you wrote the notes down on paper or in your head first. Now can you do it?)
I also wanted to point out that there are in fact some humans for which reversibility is not trivial and they don’t automatically infer “B is A” if told, or shown that “A is B”. They are humans before the age of ~7 on average. The capacity to recognise and apply symmetry needs to be developed over time in humans too, linguistically or non-linguistically. In developmental psychology, there’s been a lot of theorizing and experimentation on why children develop cognitive capacity X at ages ~Y-Z and whether capacity A is always developed before B etc., that seems to be playing out in LLMs (to my initially large, but diminishing surprise as evidence mounts).
For example, there are some well-known observations about children and when they do or don’t display the conservation capacity, which I would now expect to see in (some) LLMs. Things like acquiring conservation in the first place, the problem size effect, the screening effects, and the length bias effect* (this one I’m especially curious to see if the “smaller” vision-language models trained on smaller datasets will have this bias) should be in some of the less complex LLMs, if you roughly think of GPT-2, GPT-3, GPT-3.5, GPT-4 (no vision) etc. as models increasing in developmental “age”.
*From paper:
Acquisition of conservation: there is a shift from nonconservation to conservation beliefs regarding large quantities starting around the age of 6 to 7 years, and this shift can be rather abrupt
Problem size effect: correct conservation judgments emerge for small quantities before larger quantities
Length bias: nonconservers tend to choose the longer row as having more items than the shorter row
Screening effect: younger children (3 to 6 years) conserve only until they actually see the results of the transformation (called the screening effect because the effects of the transformation are temporarily screened from view).
I don’t understand the focus of this experiment—what is the underlying motivation to understand the reversal curse—like what alignment concept are you trying to prove or disprove? is this a capabilities check only?
Additionally, the supervised, labeled approach used for injecting false information doesn’t seem to replicate how these AI systems learn data during training. I see this as a flaw in this experiment. I would trust the results of this experiment if you inject the false information with an unsupervised learning approach to mimic the training environment.
Is this surprising though? When I read the title I was thinking “Yea, that seems pretty obvious”
Speaking for myself, I would have confidently predicted the opposite result for the largest models.
My understanding is that LLMs work by building something like a world-model during training by compressing the data into abstractions. I would have expected something like “Tom Cruise’s mother is Mary Lee Pfeiffer” to be represented in the model as an abstract association between the names that could then be “decompressed” back into language in a lot of different ways.
The fact that it’s apparently represented in the model only as that exact phrase (or maybe as some kind of very alien abstraction?) leads me think that LLMs are either a bit more like “stochastic parrots” than I would have expected, or that their world-models are a lot more alien.
The largest models should be expected to compress less than smaller ones though, right?
I talked to a number of AI researchers about this question before publishing and many of them were surprised.
Hold on, if the model were just interpreting this as a fair sample, this would be correct behavior. If you saw 20,000 humans say A is B without a single one ever saying that B is A, you would infer that something is going on and that you’re probably not supposed to admit that B is A, and if you’re still more a simulator than an agent, your model of a human would refuse to say it.
Do the tests address this? Or do they need to? (I don’t feel like I have an intuitive handle on how LLMs learn anything btw)
Evidence in favour of ‘associative retrieval’ rather than ‘can’t invert logic’. I spent about 10 mins haphazardly prompt tuning to get this. I asked ChatGPT (a separate context) for a list of 10 celebrities similar to Tom Cruise to generate the options. This is GPT3.5, I haven’t tried any others or any of the other problems.
https://chat.openai.com/share/9ade9a64-6a0a-4829-9504-a4ab84b30132
I’m interested in the serial order effect independently of the logic. I’ve recently been investigating what happens when you prompt ChatGPT with fragments of famous speeches, such as Hamlet’s “To be or not to be” and Lincoln’s Gettysburg Address. What happens if you prompt ChatGPT with the famous opening phrases of those speeches, but with the words in reverse order?
So, just give it “Hamlet” and “Lincoln” as clues and it figures them out.
As for the alphabet:
This is a point that has puzzled me for a long time: if human-level reasoning ability, at its essence, is also a form of “pattern matching,” then there is still room for improvement in the Transformer architecture. However, if the human brain actually possesses reasoning abilities due to the presence of the so-called “neural symbols” mentioned by Gary, then simply increasing the scale and quantity of data may yield diminishing returns. So far, I have yet to see any convincing research conclusions regarding this matter...
I’ve noticed this a while ago. It’s not the only thing that AIs have trouble with.
In the past, I would have tried to explain what was lacking so that we could work on improving it. Now I’m glad that they don’t know.
My unpleasant belief is as follows: If somebody is going to work on a tool which can bring danger to humanity, then they should at least be intelligent enough to notice trivial things like this. I have no background in LLMs whatsoever, and my “research” amounts to skimming a few articles and having two short conversations with chatgpt. But even I can tell what goes wrong and why, as I have thought a bit about intelligence in humans.
If you had zero competence in a field, and you saw an “expert” having trouble with something that you could help him with, you’d likely worry and question his competence as well as your own competence.