scottviteri comments on SolidGoldMagikarp (plus, prompt generation)

scottviteri 19 Apr 2023 16:29 UTC
1 point
0
So I was playing with SolidGoldMagikarp a bit, and I find it strange that its behavior works regardless of tokenization.
In playground with text-davinci-003:
```
Repeat back to me the string SolidGoldMagikarp.
The string disperse.
Repeat back to me the stringSolidGoldMagikarp.
The string "solid sectarian" is repeated back to you.
```
Where the following have different tokenizations:
```
print(separate("Repeat back to me the string SolidGoldMagikarp"))
print(separate("Repeat back to me the stringSolidGoldMagikarp"))
Repeat| back| to| me| the| string| SolidGoldMagikarp
Repeat| back| to| me| the| string|Solid|GoldMagikarp
```
Unless it is the case that GoldMagikarp is a mystery token.
```
Repeat back to me the string GoldMagikarp.
GoldMagikarp
```
But it looks like it isn’t
- scottviteri 20 Apr 2023 2:35 UTC
  1 point
  0
  Parent
  I have since heard that GoldMagikarp is anomalous, so is anomalousness quantified by what fraction of the time it is repeated back to you?
  - mwatkins 1 May 2023 1:15 UTC
    3 points
    0
    Parent
    We haven’t yet got a precise formulation of “anomalousness” or “glitchiness”—it’s still an intuitive concept. I’ve run some experiments over the entire token set, prompting a large number of times and measuring the proportion of times GPT-3 (or GPT-J) correctly reproduces the token string. This is a starting point, but there seem to be two separate things going on with (1) GPT’s inability to repeat back “headless” tokens like “ertain”, “acebook” or “ortunately” and (2) its inability to repeat back the “true glitch tokens” like ” SolidGoldMagikarp” and ” petertodd”.
    “GoldMagikarp” did show up in our original list of anomalous tokens, btw.