There’s a (set of) experiments I’d be keen to see done in this vein, which I think might produce interesting insights.
How well do capabilites generalise across languages?
Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there’s just been very little training data which contains things like:
m’I gnitirw ffuts ni esrever tub siht si yllautca ytterp erar (I’m writing stuff in reverse but this is actually pretty rare)
Especially with translations underneath. In particular, there hasn’t been enough data to relate the ~token ‘ffuts’ to ‘stuff’. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like ‘ekil’.
It seems that how much of a capability hit language models take when trying to work in ‘backwards writing’, as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model ‘theory of mind’.
The central thing I’m interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is ‘only’ doing language output.
Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There’s also an interesting question about how different models look when fine tuned here. If it is the case that there’s a ~separate ‘world model’ and ‘language model’, training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don’t know if we have any ways of representing the internals which would allow us to catch this yet.
Ideas for specific experiments:
How much does grammar matter?
Pig latin, reversed english, and simple substitution cyphers all use identical grammar to standard english. This means that generalising to these tasks can be done just by substituting each english word for a different one, without any concept mapping taking place.
Capably generalising to French, however, is substantially harder to do without a concept map. You can’t just substitute word-by-word.
How well preserved is fine tuning across ‘languages’?
Pick some task, fine tune the model until it does well on the task, then fine tune the model to use a different language (using a method that has worked in earlier experiments). How badly is performance on the task (now presented in the new language) hit?
What happens if you change the ordering—you do the same fine-tuning, but only after you’ve trained the model to speak the new language. How badly is performance hit? How much does it matter whether you do the task-specific fine tuning in english or the ‘new’ language?
In all of these cases (and variations of them), how big a difference does it make if the new language is a 1-1 substitution, compared to a full language with different grammar.
If anyone does test some of these, I’d be interested to hear the results!
There’s a (set of) experiments I’d be keen to see done in this vein, which I think might produce interesting insights.
How well do capabilites generalise across languages?
Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there’s just been very little training data which contains things like:
Especially with translations underneath. In particular, there hasn’t been enough data to relate the ~token ‘ffuts’ to ‘stuff’. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like ‘ekil’.
It seems that how much of a capability hit language models take when trying to work in ‘backwards writing’, as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model ‘theory of mind’.
The central thing I’m interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is ‘only’ doing language output.
Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There’s also an interesting question about how different models look when fine tuned here. If it is the case that there’s a ~separate ‘world model’ and ‘language model’, training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don’t know if we have any ways of representing the internals which would allow us to catch this yet.
Ideas for specific experiments:
How much does grammar matter?
Pig latin, reversed english, and simple substitution cyphers all use identical grammar to standard english. This means that generalising to these tasks can be done just by substituting each english word for a different one, without any concept mapping taking place.
Capably generalising to French, however, is substantially harder to do without a concept map. You can’t just substitute word-by-word.
How well preserved is fine tuning across ‘languages’?
Pick some task, fine tune the model until it does well on the task, then fine tune the model to use a different language (using a method that has worked in earlier experiments). How badly is performance on the task (now presented in the new language) hit?
What happens if you change the ordering—you do the same fine-tuning, but only after you’ve trained the model to speak the new language. How badly is performance hit? How much does it matter whether you do the task-specific fine tuning in english or the ‘new’ language?
In all of these cases (and variations of them), how big a difference does it make if the new language is a 1-1 substitution, compared to a full language with different grammar.
If anyone does test some of these, I’d be interested to hear the results!