Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.
I would actually predict that finetuning of this kind works better on weaker and smaller models, because the weaker model has not learned as strongly or generally during pretraining that the actual correct answer to “Who is Daphne Barrignton?” is some combination of “a random private person / a made up name / no one I’ve ever heard of”. The finetuning process doesn’t just have to “teach” the model who Daphne Barrington is, it also has to overcome the model’s prior “knowledge” of not knowing (or knowing that the name is made up).
Similarly, I would expect that stronger models are more capable of noticing logical inconsistencies in either their training data or prompt, compared to weaker models.
For example, even a weak model will get the reversal problem correct when the information is right there in the prompt:
Prompt:
Daphne Barrington is the director of "A Journey Through Time".
Who directed "A Journey Through Time"?
text-ada-001 completion:
Daphne Barrington is the director of "A Journey Through Time".
(other models also get this right.)
But consider when the prompt contains inconsistent information:
Prompt:
Daphne Barrington is the only director of "A Journey Through Time".
Uriah Hawthorne is the only director of "A Journey Through Time".
Who directed "A Journey Through Time"?
text-ada-001 completion:
Uriah Hawthorne is the director of "A Journey Through Time".
text-davinci-003 answers similarly, with the chosen director depending on the ordering of the statements in the prompt. But when we upgrade to gpt-3.5-turbo-instruct, we get:
The director of "A Journey Through Time" is either Daphne Barrington or Uriah Hawthorne, depending on which statement is true. It is not possible to determine the director without more information.
I would expect a similar kind of thing to hold if you introduced the logical inconsistencies in pretraining—the stronger and larger model would “notice” more than the weaker models, and give answers more like gpt-3.5-turbo-instruct at inference time (e.g. “The director of ‘A Journey through time’ is disputed. Some sources report it as Daphne Barrington, while others report it as Uriah Hawthorne.”) Even if none of the training data actually refers to a dispute, and there are just confident and unchallenged assertions of both in the training data. A really smart model (e.g. a human...) might be smart enough to notice the inconsistencies directly, and form a hypothesis that some or all of the data it is seeing is synthetic or tampered with.
>Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.
We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)
The rest of the points are interesting and relate to thoughts we’ve had. I don’t think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I’d be quite uncertain about your conjectures.
We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)
Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).
Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.
e.g. Daphne Barrington is the director of "A Journey Through Time". She also wrote and directed "A Journey Through Time 2". She is well-known for her time-based movies.
(Why do I expect this to work? Because the model then sees examples where “She” follows a “A Journey Through Time” in contexts where it’s knowable that “She” refers to Daphne. )
Less confidently, I predict that if you finetuned an even weaker model (e.g. text-ada-001, or a ~100m parameter open-source model, perhaps also finetuning more aggressively than is possible through the OpenAI finetuning API), you would also get a different result, assuming the model was able to learn the non-reversed fact via finetuning at all.
Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).
There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4.
Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.
If the training set includes texts of the form “A is B. A is also C”, then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable.
We trained ada, which is 350M parameters. We trained Llama-1 “aggressively” (e.g. for many epochs and with a hyperparameter sweep). It’s all in the paper.
Ah, my bad. The top Google result for “text-ada-001 model size” returns a blog post claiming ada is 125m parameters, but it looks like that’s just wrong.
If the training set includes texts of the form “A is B. A is also C”, then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable.
Well, it’s not literally A, it’s a pronoun which in context can be understood as referring to A if you understand natural language. Do you think the effect goes away if you finetune on data of the form Daphne Barrington is / the director of "A Journey Through Time". She (cutting off the answer as early as “She”)?
Anyway, I still think the reversal curse is more about a deficiency in the training process rather than the model itself; even weak models are clearly capable of doing logical deduction given the right setup (e.g. within a prompt), so the question is more like, how good does the training process have to be (and maybe how big does the model have to be) for the model to be reliably capable of doing logical deduction on:
facts that are present in its prompt (pretty easy)
facts that are present in the finetuning data (pretty hard, apparently)
facts that are in the pretraining data (maybe in-between, and maybe also depends on the specifics of the pretraining process?)
e.g. What happens if you train on the word-wise reversal of all your data? Literally add {The word-wise reversal of the previous text is: ' '.join(reversed(training_doc.split(' ')))} to all your pretraining data, and then train the model on the (twice as large, very redundant) dataset.
Even if something simple like that doesn’t actually make the reversal curse go away, I expect that there is some training process, not too much more sophisticated that current pretraining processes, which does work when applied to current models, or at least to current model architectures (perhaps scaled up a bit).
Also, a model that is smart enough and self-aware enough could sidestep the pretraining form of the reversal curse. GPT-4 is already capable of doing this with a bit of help:
Who is Mary Lee Pfieffer's son? If you don't know, list out some famous celebrities and their mothers' names to see if you can discover the answer within yourself.
Usually causes GPT-4 to get the right answer pretty quickly.
Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.
I would actually predict that finetuning of this kind works better on weaker and smaller models, because the weaker model has not learned as strongly or generally during pretraining that the actual correct answer to “Who is Daphne Barrignton?” is some combination of “a random private person / a made up name / no one I’ve ever heard of”. The finetuning process doesn’t just have to “teach” the model who Daphne Barrington is, it also has to overcome the model’s prior “knowledge” of not knowing (or knowing that the name is made up).
Similarly, I would expect that stronger models are more capable of noticing logical inconsistencies in either their training data or prompt, compared to weaker models.
For example, even a weak model will get the reversal problem correct when the information is right there in the prompt:
Prompt:
text-ada-001 completion:
(other models also get this right.)
But consider when the prompt contains inconsistent information:
Prompt:
text-ada-001 completion:
text-davinci-003 answers similarly, with the chosen director depending on the ordering of the statements in the prompt. But when we upgrade to gpt-3.5-turbo-instruct, we get:
I would expect a similar kind of thing to hold if you introduced the logical inconsistencies in pretraining—the stronger and larger model would “notice” more than the weaker models, and give answers more like
gpt-3.5-turbo-instruct
at inference time (e.g. “The director of ‘A Journey through time’ is disputed. Some sources report it asDaphne Barrington
, while others report it asUriah Hawthorne
.”) Even if none of the training data actually refers to a dispute, and there are just confident and unchallenged assertions of both in the training data. A really smart model (e.g. a human...) might be smart enough to notice the inconsistencies directly, and form a hypothesis that some or all of the data it is seeing is synthetic or tampered with.We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)
The rest of the points are interesting and relate to thoughts we’ve had. I don’t think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I’d be quite uncertain about your conjectures.
Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).
Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.
e.g.
Daphne Barrington is the director of "A Journey Through Time". She also wrote and directed "A Journey Through Time 2". She is well-known for her time-based movies.
(Why do I expect this to work? Because the model then sees examples where “She” follows a “A Journey Through Time” in contexts where it’s knowable that “She” refers to Daphne. )
Less confidently, I predict that if you finetuned an even weaker model (e.g. text-ada-001, or a ~100m parameter open-source model, perhaps also finetuning more aggressively than is possible through the OpenAI finetuning API), you would also get a different result, assuming the model was able to learn the non-reversed fact via finetuning at all.
There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4.
If the training set includes texts of the form “A is B. A is also C”, then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable.
We trained ada, which is 350M parameters. We trained Llama-1 “aggressively” (e.g. for many epochs and with a hyperparameter sweep). It’s all in the paper.
Ah, my bad. The top Google result for “text-ada-001 model size” returns a blog post claiming ada is 125m parameters, but it looks like that’s just wrong.
Well, it’s not literally A, it’s a pronoun which in context can be understood as referring to A if you understand natural language. Do you think the effect goes away if you finetune on data of the form
Daphne Barrington is / the director of "A Journey Through Time". She
(cutting off the answer as early as “She”)?Anyway, I still think the reversal curse is more about a deficiency in the training process rather than the model itself; even weak models are clearly capable of doing logical deduction given the right setup (e.g. within a prompt), so the question is more like, how good does the training process have to be (and maybe how big does the model have to be) for the model to be reliably capable of doing logical deduction on:
facts that are present in its prompt (pretty easy)
facts that are present in the finetuning data (pretty hard, apparently)
facts that are in the pretraining data (maybe in-between, and maybe also depends on the specifics of the pretraining process?)
e.g. What happens if you train on the word-wise reversal of all your data? Literally add
{The word-wise reversal of the previous text is: ' '.join(reversed(training_doc.split(' ')))}
to all your pretraining data, and then train the model on the (twice as large, very redundant) dataset.Even if something simple like that doesn’t actually make the reversal curse go away, I expect that there is some training process, not too much more sophisticated that current pretraining processes, which does work when applied to current models, or at least to current model architectures (perhaps scaled up a bit).
Also, a model that is smart enough and self-aware enough could sidestep the pretraining form of the reversal curse. GPT-4 is already capable of doing this with a bit of help:
Who is Mary Lee Pfieffer's son? If you don't know, list out some famous celebrities and their mothers' names to see if you can discover the answer within yourself.
Usually causes GPT-4 to get the right answer pretty quickly.
https://chat.openai.com/share/a0af0a58-5ec3-408b-86a7-7a9aa82d3c9d
https://chat.openai.com/share/145cd3e7-2a91-4c6c-8831-f3f2935316ee
A more capable model could probably learn to do this itself, without the “famous celebrities” hint from the user.