With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English.
We still don’t know why.
I wish someone would figure this out.
I find myself surprised/confused at his apparent surprise/confusion.
My default response had someone asked me what I thought was going on would have been something like: “following instructions is a natural abstraction of a task”, hence models trained to follow instructions in English generalising to other languages is a natural example of goals/capabilities generalisation.
It’s like being surprised that you taught an AI to drive red cars and it can drive blue cars as well [capabilities generalisation]. Or if you taught an AI to reach a particular location on a level when there’s a coin there and it still heads to said location in the absence of the coin. 🤭 [goal generalisation].
In Coin Run, there were multiple possible natural generalisations for the goal (location on level, the coin), but for instruction following in English, there seems to be only one natural generalisation of the goal? (I don’t really see any other intuitively plausible generalisation of “following instructions in English” when the model switches to a Chinese context.)
Like conditional on InstructGPT competently responding in languages other than English, I think we should just expect it to generalise its goal to the new context.
Is this just hindsight bias? Would I really have been counterfactually surprised/confused if it didn’t follow instructions in other languages? I don’t actually know that to be the case.
Maybe I would have just generated some other reason why we shouldn’t have expected instruction following to generalise.
But hindsight bias aside, this is a prediction of Wentworth’s natural abstractions hypothesis as I understand it (I think he would endorse this prediction), and I am a genuine believer in the NAH, so I don’t think it’s just hindsight bias.
I think it would be due to the LM in question using lots of language-neutral circuitry? See this paper.
RLHF mostly updates abstract/conceptual circuits, which (I assume) tend to be language neutral, then the language specific circuits just continue translating to/from the updated circuits.
Is RLHF updating abstract circuits an established fact? Why would it suffer from mode collapse in that case?
It’s not surprising to me, since language models have done similar things in the past, e.g. learning to translate mainly from unsupervised monolingual data.
That said, I am not sure about explaining it with “natural abstractions”. At least, I cannot immediately derive the connection to the natural abstraction arguments. I would not be surprised if there was a connection, but I would also not be surprised if there wasn’t a connection. It feels a bit like a Mysterious Answer if I cannot directly derive the connection. But I haven’t thought much about it, so it may be obvious if I think harder.
The natural abstraction here is the goal/task, and when the context/environment changes, I’m claiming that the system is likely to generalise to a natural abstraction of the goal/task in the new context/environment.
The natural generalisation of “follow instructions in English” is “follow instructions” and this translates to other language contexts.
John Wentworth has a number of specific arguments about natural abstractions, e.g. resampling, telephone theorem, etc.
When you use the term “natural abstraction” to say that it is predictable/expected, are you saying that those arguments predict this outcome? If so, do you have a specific argument in mind?
I feel like this answer glosses over the fact that the encoding changes. Surely you can find some encodings of instructions such that LLMs cannot follow instructions in that encoding. So the question lies in why learning the English encoding also allows the model to learn (say) German encodings.
No? We already know that the model can competently respond in German. Once you condition on the model competently responding in other languages (e.g. for translation tasks) there is no special question about why it follows instructions in other languages as well.
Like “why are LLMs capable of translation might be an interesting question”, but if you’re not asking that question, then I don’t understand why you’re asking this.
My position is that this isn’t a special capability that warrants any explanation that isn’t covered in an explanation of why/how LLMs are competent translators.
Ah, I misunderstood the content of original tweet—I didn’t register that the model indeed had access to lots of data in other languages as well. In retrospect I should have been way more shocked if this wasn’t the case. Thanks.
I then agree that it’s not too surprising that the instruction-following behavior is not dependent on language, though it’s certainly interesting. (I agree with Habryka’s response below.)
Recent works from Anders Søgaard might be relevant, e.g. Grounding the Vector Space of an Octopus: Word Meaning from Raw Text, Understanding models understanding language, Implications of the Convergence of Language and Vision Model Geometries.
E.g. from Grounding the Vector Space of an Octopus: Word Meaning from Raw Text on the success of unsupervised machine translation (and more):
’Consider, for example, the fact that unsupervised machine translation is possible (Lample et al., 2018a, b; Park et al., 2021). Unsupervised machine translation works by first aligning vector spaces induced by monolingual language models in the source and target languages (Søgaard et al., 2019). This is possible because such vector spaces are often near-isomorphic (Vulic et al., 2020). If weak supervision is available, we can use techniques such as Procrustes Analysis (Gower, 1975) or Iterative Closest Point (Besl & McKay, 1992), but aligments can be obtained in the absence of any supervision using adversarial learning (Li et al., 2019; Søgaard et al., 2019) or distributional evidence alone. If the vector spaces induced by language models exhibit high degrees of isomorphism to the physical world or human perceptions thereof, we have reason to think that similar techniques could provide us with sufficient grounding in the absence of supervision.
Unsupervised machine translation show that language model representations of different vocabularies of different languages are often isomorphic. Some researchers have also explored cross-modality alignment: (Chung et al., 2018) showed that unsupervised alignment of speech and written language is possible using the same techniques, for example. This also suggests unsupervised grounding should be possible.
Is there any direct evidence that language model vector spaces are isomorphic to (representations of) the physical world? There is certainly evidence that language models learn isomorphic representations of parts of vocabularies. Abdou et al. (2021), for example, present evidence that language models encode color in a way that is near-isomorphic to conceptual models of how color is perceived, in spite of known reporting biases (Paik et al., 2021). Patel and Pavlick (2022) present similar results for color terms and directionals. Liétard et al. (2021) show that the larger models are, the more isomorphic their representations of geographical place names are to maps of their physical location.′
‘Unsupervised machine translation and unsupervised bilingual dictionary induction are evaluated over the full vocabulary, often with more than 85% precision. This indicates language models learn to represent concepts in ways that are not very language-specific. There is also evidence for near-isomorphisms with brain activity, across less constrained subsets of the vocabulary: (Wu et al., 2021), for example, show how brain activity patterns of individual words are encoded in a way that facilitates analogical reasoning. Such a property would in the limit entail that brain encodings are isomorphic to language model representations (Peng et al., 2020). Other research articles that seem to suggest that language model representations are generally isomorphic to brain activity patterns include (Mitchell et al., 2008; Søgaard, 2016; Wehbe et al., 2014; Pereira et al., 2018; Gauthier & Levy, 2019; Caucheteux & King, 2022).’
I’ll probably write more about this soon.