I found some very similar tokens in GPT2-small using the following code (using Neel Nanda’s TransformerLens library, which does a bunch of nice things like folding layernorms into the weights of adjacent matrices).
import torch
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained('gpt2').to('cpu')
best_match = (model.W_U.T @ model.W_U).argmax(dim=-1)
for tok in (best_match != torch.arange(50257)).nonzero().flatten():
print(tok.item(), best_match[tok].item(), '~' + model.tokenizer.decode([tok.item()]) + '~',
'~' + model.tokenizer.decode([best_match[tok].item()]) + '~')
Omitting a bunch of non-printable tokens, this prints out:
This means that the unembedding direction for the tokens on the left has higher dot product with the token on the right than it does with itself.
Thus, any straightforward attempt by the model (while decoding at temperature 0) to output the string on the left will instead output the string on the right. I think some phenomenon like this is probably responsible for the SolidGoldMagikarp → distribute weirdness. I haven’t checked yet, but I predict that the corresponding word for GPT2-small is “pioneer” or “pioneers”.
I don’t know if anyone else has seen this yet, but seems like a pretty mechanistic explanation for why the tokens would be “unspeakable”.
I found some very similar tokens in GPT2-small using the following code (using Neel Nanda’s TransformerLens library, which does a bunch of nice things like folding layernorms into the weights of adjacent matrices).
Omitting a bunch of non-printable tokens, this prints out:
This means that the unembedding direction for the tokens on the left has higher dot product with the token on the right than it does with itself.
Thus, any straightforward attempt by the model (while decoding at temperature 0) to output the string on the left will instead output the string on the right. I think some phenomenon like this is probably responsible for the SolidGoldMagikarp → distribute weirdness. I haven’t checked yet, but I predict that the corresponding word for GPT2-small is “pioneer” or “pioneers”.
I don’t know if anyone else has seen this yet, but seems like a pretty mechanistic explanation for why the tokens would be “unspeakable”.
Prediction was half-right; these tokens are unspeakable but trying to elicit them at temperature 0 does not produce the token ” pione”.