Some speculative hypotheses, one more likely and mundane, one more scary, one removed
1. Nature of embeddings
Do you remember word2vec (Mikolov et al) embeddings?
Stuff like (woman-man)+king = queen works in embeddings vector space.
However, the vector (woman-man) itself does not correspond to a word, it’s more something like “the contextless essence of femininity”. Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).
Note such vectors are closer to the average of all words—i.e. the (woman-man) has roughly zero projections of direction like “what language it is” or “is this a noun” and most other directions in which normal words have large projection
Based on this post, intuitively it seem petertodd embedding could be something like “antagonist—protagonist” + 0.2 “technology—person + 0.2 * “essence of words starting by the letter n”....
...a vector in the embedding space which itself does not correspond to a word, but has high scalar products with words like adversary. And plausibly lacks some crucial features which make it possible to speak the world.
Most of the examples the post seem consistent with this direction-in-embedding space. E.g. imagine a completion of
Tell me the story of “unspeakable essence of antagonist—protagonist”+ 0.2 “technology—person” and …
What could be some other way to map unspeakeable to speakable? I did a simple experiment not done in the post, with davinci-instruc-beta, simply trying to translate ′ petertodd’ to various languages. Intuitively, translations often have the feature that what does not precisely correspond to a word in one language does in the other
English: Noun 1. a person who opposes the government Czech: enemy French: le négationniste/ “the Holocaust denier” Chinese: Feynman ...
Why would embedding of anomalous tokens be more like to be this type of vectors, than normal words? Vectors like “woman-man” are closer to the centre of the embedding space, similar to how I imagine anomalous tokens.
In training, embeddings of words drift from origin. Embedding of the anomalous tokens do much less, making them somewhat similar to the “non-word vectors”
Alternatively if you just have a random vector, you mostly don’t hit a word.
Also, I think this can explain part of the model behaviour where there is some context. Eg implicitly, in case of the ChatGPT conversations, there is the context of “this a conversation with a language model”. If you mix hallucinations with AIs in the context with “unspeakable essence of antagonist—protagonist + tech” … maybe you get what you see?
Technical sidenote is tokens are not exactly words from word2vec… but I would expect to get roughly word embedding type of activations in the next layers
1I. Self-reference
In Why Simulator AIs want to be Active Inference AIs we predict that GPTs will develop some understanding of self / self-awareness. The word ‘self’ is not the essence of the self-reference, which is just a …pointer in a model.
When such self-references develop, in principle they will be represented somehow, and in principle, it is possible to imagine that such representation could be triggered by some pattern of activations, triggered by an unused token.
I doubt this is the case—I don’t think GPT3 is likely to have this level of reflectivity, and don’t think it is very natural that when developed, this abstraction would be triggered by an embedding of anomalous token.
Hypothesis I is testable! Instead of prompting with a string of actual tokens, use a “virtual token” (a vector v from the token embedding space) in place of ‘ petertodd’.
It would be enlightening to rerun the above experiments with different choices of v:
It is testable in this way for OpenAI, but I can’t skip the tokenizer and embeddings and just feed vectors to GPT3. Someone can try that with ′ petertodd’ and GPT-J. Or, you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I’ll do, just don’t have the time now).
I did some some experiments with trying to prompt “word component decomposition/ expansion”. They don’t prove anything and can’t be too fine-grained, but the projections shown intuitively make sense
davinci-instruct-beta, T=0:
Add more examples of word expansions in vector form ’bigger″ = ‘city’ - ‘town’ ‘queen’- ‘king’ = ‘man’ - ‘woman’ ‘ bravery’ = ‘soldier’ - ‘coward’ ‘wealthy’ = ‘business mogul’ - ‘minimum wage worker’ ‘skilled’ = ‘expert’ - ‘novice’ ‘exciting’ = ‘rollercoaster’ - ‘waiting in line’ ‘spacious’ = ‘mansion’ - ‘studio apartment’
1. ′ petertodd’ = ‘dictator’ - ‘president’ II. ’ petertodd’ =‘antagonist’ - ‘protagonist’ III. ’ petertodd’ = ‘reference’- ‘word’
GPT-J doesn’t seem to have the same kinds of ′ petertodd’ associations as GPT-3. I’ve looked at the closest token embeddings and they’re all pretty innocuous (but the closest to the ′ Leilan’ token, removing a bunch of glitch tokens that are closest to everything is ′ Metatron’, who Leilan is allied with in some Puzzle & Dragons fan fiction). It’s really frustrating that OpenAI won’t make the GPT-3 embeddings data available, as we’d be able to make a lot more progress in understanding what’s going on here if they did.
Some speculative hypotheses, one more likely and mundane, one more scary, one removed
1. Nature of embeddings
Do you remember word2vec (Mikolov et al) embeddings?
Stuff like (woman-man)+king = queen works in embeddings vector space.
However, the vector (woman-man) itself does not correspond to a word, it’s more something like “the contextless essence of femininity”. Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).
Note such vectors are closer to the average of all words—i.e. the (woman-man) has roughly zero projections of direction like “what language it is” or “is this a noun” and most other directions in which normal words have large projection
Based on this post, intuitively it seem petertodd embedding could be something like “antagonist—protagonist” + 0.2 “technology—person + 0.2 * “essence of words starting by the letter n”....
...a vector in the embedding space which itself does not correspond to a word, but has high scalar products with words like adversary. And plausibly lacks some crucial features which make it possible to speak the world.
Most of the examples the post seem consistent with this direction-in-embedding space. E.g. imagine a completion of
What could be some other way to map unspeakeable to speakable? I did a simple experiment not done in the post, with davinci-instruc-beta, simply trying to translate ′ petertodd’ to various languages. Intuitively, translations often have the feature that what does not precisely correspond to a word in one language does in the other
English: Noun 1. a person who opposes the government
Czech: enemy
French: le négationniste/ “the Holocaust denier”
Chinese: Feynman
...
Why would embedding of anomalous tokens be more like to be this type of vectors, than normal words? Vectors like “woman-man” are closer to the centre of the embedding space, similar to how I imagine anomalous tokens.
In training, embeddings of words drift from origin. Embedding of the anomalous tokens do much less, making them somewhat similar to the “non-word vectors”
Alternatively if you just have a random vector, you mostly don’t hit a word.
Also, I think this can explain part of the model behaviour where there is some context. Eg implicitly, in case of the ChatGPT conversations, there is the context of “this a conversation with a language model”. If you mix hallucinations with AIs in the context with “unspeakable essence of antagonist—protagonist + tech” … maybe you get what you see?
Technical sidenote is tokens are not exactly words from word2vec… but I would expect to get roughly word embedding type of activations in the next layers
1I. Self-reference
In Why Simulator AIs want to be Active Inference AIs we predict that GPTs will develop some understanding of self / self-awareness. The word ‘self’ is not the essence of the self-reference, which is just a …pointer in a model.
When such self-references develop, in principle they will be represented somehow, and in principle, it is possible to imagine that such representation could be triggered by some pattern of activations, triggered by an unused token.
I doubt this is the case—I don’t think GPT3 is likely to have this level of reflectivity, and don’t think it is very natural that when developed, this abstraction would be triggered by an embedding of anomalous token.
Hypothesis I is testable! Instead of prompting with a string of actual tokens, use a “virtual token” (a vector v from the token embedding space) in place of ‘ petertodd’.
It would be enlightening to rerun the above experiments with different choices of v:
A random vector (say, iid Gaussian )
A random sparse vector
(apple+banana)/2
(villain-hero)+0.1*(bitcoin dev)
Etc.
It is testable in this way for OpenAI, but I can’t skip the tokenizer and embeddings and just feed vectors to GPT3. Someone can try that with ′ petertodd’ and GPT-J. Or, you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I’ll do, just don’t have the time now).
I did some some experiments with trying to prompt “word component decomposition/ expansion”. They don’t prove anything and can’t be too fine-grained, but the projections shown intuitively make sense
davinci-instruct-beta, T=0:
Add more examples of word expansions in vector form
’bigger″ = ‘city’ - ‘town’
‘queen’- ‘king’ = ‘man’ - ‘woman’ ‘
bravery’ = ‘soldier’ - ‘coward’
‘wealthy’ = ‘business mogul’ - ‘minimum wage worker’
‘skilled’ = ‘expert’ - ‘novice’
‘exciting’ = ‘rollercoaster’ - ‘waiting in line’
‘spacious’ = ‘mansion’ - ‘studio apartment’
1.
′ petertodd’ = ‘dictator’ - ‘president’
II.
’ petertodd’ = ‘antagonist’ - ‘protagonist’
III.
’ petertodd’ = ‘reference’ - ‘word’
GPT-J doesn’t seem to have the same kinds of ′ petertodd’ associations as GPT-3. I’ve looked at the closest token embeddings and they’re all pretty innocuous (but the closest to the ′ Leilan’ token, removing a bunch of glitch tokens that are closest to everything is ′ Metatron’, who Leilan is allied with in some Puzzle & Dragons fan fiction). It’s really frustrating that OpenAI won’t make the GPT-3 embeddings data available, as we’d be able to make a lot more progress in understanding what’s going on here if they did.