Nice post. I was surprised that the model provides the same nonsense definition regardless of the token when the embedding is rescaled to be large, and moreover that this nonsense definition is very similar to the one given when the embedding is rescaled to be small. Here’s an explanation I find vaguely plausible. Suppose the model completes the task as follows:
The model sees the prompt 'A typical definition of <token> would be '.
At some attention head A1, the <token> position attends back to 'definition' and gains a component in the residual stream direction that represents the I am the token being defined feature.
At some later attention head A2, the final position of the prompt attends back to positions with the I am the token being defined feature, and moves whatever information from that position is needed for defining the corresponding token.
Now, suppose we rescale the <token> embedding to be very large. The size of the I am the token being defined component moved to the <token> position by A1 stays roughly the same as before (since no matter how much we scale query vectors, attention probabilities can never exceed 1). So, as a fraction of the total norm of the residual stream at that position, we’ve made the I am the token being defined component a lot smaller.
Then, when the residual stream is fed into the layernorm preceding A2, the I am the token being defined component gets squashed down to almost zero: it has been “squeezed out” by the very large token embedding. Hence, when the QK matrix of A2 looks for positions with the I am the token being defined feature, it finds nothing, and all the model can do is give some generic nonsense definition. Unsurprisingly, this nonsense definition ends up being pretty similar to the one given when the token embedding is sent to zero, since in both cases the model is essentially trying to define a token that isn’t there.
The details of this explanation may be totally wrong, and I haven’t checked any of this. But my guess is that something roughly along these lines is correct.
Others have suggested that the vagueness of the definitions at small and large distance from centroid are a side effect of layernorm (although you’ve given the most detailed account of how that might work). This seemed plausible at the time, but not so much now that I’ve just found this:
The prompt “A typical definition of ″ would be ’”, where there’s no customised embedding involved (we’re just eliciting a definition of the null string) gives “A person who is a member of a group.” at temp 0. And I’ve had confirmation from someone with GPT4 base model access that it does exactly the same thing (so I’d expect this is something across all GPT models—a shame GPT3 is no longer available to test this).
Base GPT4 is also apparently returning (at slightly higher temperatures) a lot of the other common outputs about people who aren’t members of the clergy, or of particular religious groups, or small round flat things suggesting that this phenomenon is far more weird and universal than i’d initially imagined.
Nice post. I was surprised that the model provides the same nonsense definition regardless of the token when the embedding is rescaled to be large, and moreover that this nonsense definition is very similar to the one given when the embedding is rescaled to be small. Here’s an explanation I find vaguely plausible. Suppose the model completes the task as follows:
The model sees the prompt
'A typical definition of <token> would be '
.At some attention head A1, the
<token>
position attends back to'definition'
and gains a component in the residual stream direction that represents the I am the token being defined feature.At some later attention head A2, the final position of the prompt attends back to positions with the I am the token being defined feature, and moves whatever information from that position is needed for defining the corresponding token.
Now, suppose we rescale the
<token>
embedding to be very large. The size of the I am the token being defined component moved to the<token>
position by A1 stays roughly the same as before (since no matter how much we scale query vectors, attention probabilities can never exceed 1). So, as a fraction of the total norm of the residual stream at that position, we’ve made the I am the token being defined component a lot smaller.Then, when the residual stream is fed into the layernorm preceding A2, the I am the token being defined component gets squashed down to almost zero: it has been “squeezed out” by the very large token embedding. Hence, when the QK matrix of A2 looks for positions with the I am the token being defined feature, it finds nothing, and all the model can do is give some generic nonsense definition. Unsurprisingly, this nonsense definition ends up being pretty similar to the one given when the token embedding is sent to zero, since in both cases the model is essentially trying to define a token that isn’t there.
The details of this explanation may be totally wrong, and I haven’t checked any of this. But my guess is that something roughly along these lines is correct.
Others have suggested that the vagueness of the definitions at small and large distance from centroid are a side effect of layernorm (although you’ve given the most detailed account of how that might work). This seemed plausible at the time, but not so much now that I’ve just found this:
The prompt “A typical definition of ″ would be ’”, where there’s no customised embedding involved (we’re just eliciting a definition of the null string) gives “A person who is a member of a group.” at temp 0. And I’ve had confirmation from someone with GPT4 base model access that it does exactly the same thing (so I’d expect this is something across all GPT models—a shame GPT3 is no longer available to test this).
Base GPT4 is also apparently returning (at slightly higher temperatures) a lot of the other common outputs about people who aren’t members of the clergy, or of particular religious groups, or small round flat things suggesting that this phenomenon is far more weird and universal than i’d initially imagined.
Here’s the upper section (most probable branches) of GPT-J’s definition tree for the null string:
Thanks! That’s the best explanation I’ve yet encountered. There had been previous suggestions that layer norm is a major factor in this phenomenon