I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.
The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that may help us with thinking about conceptual alignment. Some examples of what I mean:
It’s likely that our conception of the kinds of representations/ontology that current models have are deeply confused. For example, one might claim that current models have features for “truth” or “human happiness”, but it also seems entirely plausible that models instead have separate circuits and features entirely for “this text makes a claim that is incorrect” and “this text has the wrong answer selected”, or in the latter case for “this text has positive sentiment” and “this text describes a human experiencing happiness” and “this text describes actions that would cause a human to be happy if they were implemented”.
I think we’re probably pretty confused about mesaoptimization, in a way that’s very difficult to resolve just by thinking more about it (source: have spent a lot of time thinking about mesaoptimizers). I think this is especially salient to the people trying to make model organisms—which I think is a really exciting avenue—because if you try to make a mesaoptimizer, you immediately collide head on with things like finding that “training selects from the set of goals weighted by complexity” hypothesis doesn’t seem to accurately describe current model training. I think it’s appropriate to feel pretty confused about this and carefully examine the reasons why current models don’t exhibit these properties. It’s entirely reasonable for the answer to be “I expect future models to have thing X that current models don’t have”—then, you can try your best to test various X’s before having the future AIs that actually kill everyone.
There are some things that we expect AGI to do that current ML systems do not do. Partly this will be because in fact current ML systems are not analogous to future AGI in some ways—probably if you tell the AGI that A is B, it will also know that B is A. This does not necessarily have to be a property that gradually emerges and can be forecasted with a scaling law; it could emerge in a phase change, or be the result of some future algorithmic innovation. If you believe there is some property X of current ML that causes this failure, and that it will be no longer a failure in the future, then you should also be suspicious of any alignment proposal that depends on this property (and the dependence of the proposal on X may be experimentally testable). For instance, it is probably relatively easy to make an RL trained NN policy be extremely incoherent in a small subset of cases, because the network has denormalized contextual facts that are redundant across many situations. I expect this to probably be harder in models which have more unified representations for facts. To the extent I believe a given alignment technique works because it leverages this denormalization, I would be more skeptical of it working in the future.
As a counterpoint, it might also be that we had an inaccurate conception of what capabilities AGI will have that current ML systems do not have—I think one important lesson of GPT-* has been that even with these failures, the resulting systems can still be surprisingly useful.
Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.
One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it’s less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety.
What is your reason for doing it and publishing it?
Speaking for myself, I think this research was worth publishing because its benefits to understanding LLMs outweigh its costs from advancing capabilities.
In particular, the reversal curse shows us how LLM cognition differs from human cognition in important ways, which can help us understand the “psychology” of LLMs. I don’t think this finding will to advance capabilities a lot because:
It doesn’t seem like a strong impediment to LLM performance (as indicated by the fact that people hadn’t noticed it until now).
Many facts are presented in both directions during training, so the reversal curse is likely not a big deal in practice.
Bidirectional LLMs (e.g. BERT) likely do not suffer from the reversal curse.[1] If solving the reversal curse confers substantial capabilities gains, people could have taken advantage of this by switching from autoregressive LLMs to bidirectional ones.
In database design, sometimes you have a column in one table whose entries are pointers into another table—e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things—e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it’s relatively compact), and it can also be changed once for everyone if that’s a thing someone wants to do (e.g. to correct a typo). That data is called “normalized”.
But it’s also inefficient at runtime to need to follow that pointer and fetch data from the second table, so sometimes people will “denormalize” the data—i.e. store the whole address directly in the User table, separately for each user. Leo’s using that as an analogy for a net separately “storing” versions of the “same fact” for many different contexts.
Partly this will be because in fact current ML systems are not analogous to future AGI in some ways—probably if you tell the AGI that A is B, it will also know that B is A.
One oddity of LLMs is that we don’t have a good way to tell the model that A is B in a way that it can remember. Prompts are not persistent, and as this paper shows, fine tuning doesn’t do a good job of getting a fact into the model without doing a bunch of paraphrasing. Pretraining presumably works in a similar way.
This is weird! And I think helps make sense of some of the problems we see with current language models.
Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model. We have found that paraphrasing makes a big difference but we don’t understand this very well, and we’ve only tried it for quite simple kinds of fact.
I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.
The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that may help us with thinking about conceptual alignment. Some examples of what I mean:
It’s likely that our conception of the kinds of representations/ontology that current models have are deeply confused. For example, one might claim that current models have features for “truth” or “human happiness”, but it also seems entirely plausible that models instead have separate circuits and features entirely for “this text makes a claim that is incorrect” and “this text has the wrong answer selected”, or in the latter case for “this text has positive sentiment” and “this text describes a human experiencing happiness” and “this text describes actions that would cause a human to be happy if they were implemented”.
I think we’re probably pretty confused about mesaoptimization, in a way that’s very difficult to resolve just by thinking more about it (source: have spent a lot of time thinking about mesaoptimizers). I think this is especially salient to the people trying to make model organisms—which I think is a really exciting avenue—because if you try to make a mesaoptimizer, you immediately collide head on with things like finding that “training selects from the set of goals weighted by complexity” hypothesis doesn’t seem to accurately describe current model training. I think it’s appropriate to feel pretty confused about this and carefully examine the reasons why current models don’t exhibit these properties. It’s entirely reasonable for the answer to be “I expect future models to have thing X that current models don’t have”—then, you can try your best to test various X’s before having the future AIs that actually kill everyone.
There are some things that we expect AGI to do that current ML systems do not do. Partly this will be because in fact current ML systems are not analogous to future AGI in some ways—probably if you tell the AGI that A is B, it will also know that B is A. This does not necessarily have to be a property that gradually emerges and can be forecasted with a scaling law; it could emerge in a phase change, or be the result of some future algorithmic innovation. If you believe there is some property X of current ML that causes this failure, and that it will be no longer a failure in the future, then you should also be suspicious of any alignment proposal that depends on this property (and the dependence of the proposal on X may be experimentally testable). For instance, it is probably relatively easy to make an RL trained NN policy be extremely incoherent in a small subset of cases, because the network has denormalized contextual facts that are redundant across many situations. I expect this to probably be harder in models which have more unified representations for facts. To the extent I believe a given alignment technique works because it leverages this denormalization, I would be more skeptical of it working in the future.
As a counterpoint, it might also be that we had an inaccurate conception of what capabilities AGI will have that current ML systems do not have—I think one important lesson of GPT-* has been that even with these failures, the resulting systems can still be surprisingly useful.
Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.
One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it’s less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety. What is your reason for doing it and publishing it?
Speaking for myself, I think this research was worth publishing because its benefits to understanding LLMs outweigh its costs from advancing capabilities.
In particular, the reversal curse shows us how LLM cognition differs from human cognition in important ways, which can help us understand the “psychology” of LLMs. I don’t think this finding will to advance capabilities a lot because:
It doesn’t seem like a strong impediment to LLM performance (as indicated by the fact that people hadn’t noticed it until now).
Many facts are presented in both directions during training, so the reversal curse is likely not a big deal in practice.
Bidirectional LLMs (e.g. BERT) likely do not suffer from the reversal curse.[1] If solving the reversal curse confers substantial capabilities gains, people could have taken advantage of this by switching from autoregressive LLMs to bidirectional ones.
Since they have to predict “_ is B” in addition to “A is _”.
What’s “denormalization”?
In database design, sometimes you have a column in one table whose entries are pointers into another table—e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things—e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it’s relatively compact), and it can also be changed once for everyone if that’s a thing someone wants to do (e.g. to correct a typo). That data is called “normalized”.
But it’s also inefficient at runtime to need to follow that pointer and fetch data from the second table, so sometimes people will “denormalize” the data—i.e. store the whole address directly in the User table, separately for each user. Leo’s using that as an analogy for a net separately “storing” versions of the “same fact” for many different contexts.
I meant it as an analogy to https://en.m.wikipedia.org/wiki/Denormalization
One oddity of LLMs is that we don’t have a good way to tell the model that A is B in a way that it can remember. Prompts are not persistent, and as this paper shows, fine tuning doesn’t do a good job of getting a fact into the model without doing a bunch of paraphrasing. Pretraining presumably works in a similar way.
This is weird! And I think helps make sense of some of the problems we see with current language models.
Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model.
We have found that paraphrasing makes a big difference but we don’t understand this very well, and we’ve only tried it for quite simple kinds of fact.
Maybe our brains do a kind of expansion of a fact before memorizing it and its neighbors in logic space.