Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.
One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it’s less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety.
What is your reason for doing it and publishing it?
Speaking for myself, I think this research was worth publishing because its benefits to understanding LLMs outweigh its costs from advancing capabilities.
In particular, the reversal curse shows us how LLM cognition differs from human cognition in important ways, which can help us understand the “psychology” of LLMs. I don’t think this finding will to advance capabilities a lot because:
It doesn’t seem like a strong impediment to LLM performance (as indicated by the fact that people hadn’t noticed it until now).
Many facts are presented in both directions during training, so the reversal curse is likely not a big deal in practice.
Bidirectional LLMs (e.g. BERT) likely do not suffer from the reversal curse.[1] If solving the reversal curse confers substantial capabilities gains, people could have taken advantage of this by switching from autoregressive LLMs to bidirectional ones.
Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.
One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it’s less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety. What is your reason for doing it and publishing it?
Speaking for myself, I think this research was worth publishing because its benefits to understanding LLMs outweigh its costs from advancing capabilities.
In particular, the reversal curse shows us how LLM cognition differs from human cognition in important ways, which can help us understand the “psychology” of LLMs. I don’t think this finding will to advance capabilities a lot because:
It doesn’t seem like a strong impediment to LLM performance (as indicated by the fact that people hadn’t noticed it until now).
Many facts are presented in both directions during training, so the reversal curse is likely not a big deal in practice.
Bidirectional LLMs (e.g. BERT) likely do not suffer from the reversal curse.[1] If solving the reversal curse confers substantial capabilities gains, people could have taken advantage of this by switching from autoregressive LLMs to bidirectional ones.
Since they have to predict “_ is B” in addition to “A is _”.