MadHatter comments on SolidGoldMagikarp (plus, prompt generation)

MadHatter 17 Feb 2023 2:39 UTC
1 point
0
I found some very similar tokens in GPT2-small using the following code (using Neel Nanda’s TransformerLens library, which does a bunch of nice things like folding layernorms into the weights of adjacent matrices).
```
import torch
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained('gpt2').to('cpu')

best_match = (model.W_U.T @ model.W_U).argmax(dim=-1)
for tok in (best_match != torch.arange(50257)).nonzero().flatten():
    print(tok.item(), best_match[tok].item(), '~' + model.tokenizer.decode([tok.item()]) + '~', 
            '~' + model.tokenizer.decode([best_match[tok].item()]) + '~')
```
Omitting a bunch of non-printable tokens, this prints out:
```
9364 5815 ~ÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~
14827 5815 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~
23090 5815 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~
30208 15272 ~ externalTo~ ~ pione~
30212 15272 ~ externalToEVA~ ~ pione~
30897 15272 ~reportprint~ ~ pione~
30898 15272 ~embedreportprint~ ~ pione~
30905 15272 ~rawdownload~ ~ pione~
39752 15272 ~quickShip~ ~ pione~
39820 15272 ~龍�~ ~ pione~
40240 15272 ~oreAndOnline~ ~ pione~
40241 15272 ~InstoreAndOnline~ ~ pione~
42089 15272 ~ TheNitrome~ ~ pione~
45544 15272 ~ サーティ~ ~ pione~
```
This means that the unembedding direction for the tokens on the left has higher dot product with the token on the right than it does with itself.
Thus, any straightforward attempt by the model (while decoding at temperature 0) to output the string on the left will instead output the string on the right. I think some phenomenon like this is probably responsible for the SolidGoldMagikarp → distribute weirdness. I haven’t checked yet, but I predict that the corresponding word for GPT2-small is “pioneer” or “pioneers”.
I don’t know if anyone else has seen this yet, but seems like a pretty mechanistic explanation for why the tokens would be “unspeakable”.
- MadHatter 17 Feb 2023 2:57 UTC
  1 point
  0
  Parent
  Prediction was half-right; these tokens are unspeakable but trying to elicit them at temperature 0 does not produce the token ” pione”.