lsusr comments on SolidGoldMagikarp (plus, prompt generation)

lsusr 5 Feb 2023 22:59 UTC
6 points
0
Since only some of the tokens in the cluster generated anomalous behavior, I ran an experiment to sort the truly anomalous tokens from the normal tokens.

Procedure: Open GPT-3 Playground. Repeat the token five times, separated by spaces. I never used quotes and I skipped all tokens containing a leading space. Temperature=0.7.

I removed quotes because whether or not you use quotes matters a lot. The string ForgeModLoader (without quotes) produces normal behavior but the string 'ForgeModLoader' (with single quotes or double quotes) generates anomalous behavior. Quotes are part of what makes it anomalous. But for other tokens like 龍喚士, quotes are not necessary.

Normal behavior includes repeating the token and giving a definition of the token (even if the definition is wrong, but understandable) e.g.TPPStreamerBot returns a definition of the Trans-Pacific Partnership and InstoreAndOnline returns a definition about an institute.

Results

Token repeats

GPT-3 repeats the token over and over again. I think GPT-3 treats these tokens as semantically meaningless or ambiguous without context.
- 天
- rawdownload
- к
- cffff
- cffffcc
- “:[{”
- ″.
- \\-
- “:”
- “:”″},{” [repeat, followed by “This is not a valid sentence.”]
Token Definitions

GPT-3 returns the definition, explanation or conversation involving definition. These tokens are semantically meaningful to GPT-3.
- ForgeModLoader
- reportprint
- embedreportprint
- rawdownloadcloneembedreportprint
- SpaceEngineers
- externalActionCode
- MpServer
- StreamerBot
- TPPStreamerBot
- iHUD
- catentry
- FactoryReloaded [repeats first]
- quickShip [conversation]
- quickShipAvailable [conversation]
- inventoryQuantity
- channelAvailability
- soType
- soDeliveryDate
- oreAndOnline
- InstoreAndOnline
- BuyableInstoreAndOnline
- natureconservancy
- assetsadobe
- Downloadha
- Nitrome
- GoldMagikarp
- DragonMagazine [definition is in Japanese]
- EStreamFrame
- ゼウス [understands “Zeus” but does not consistently produce definition, see here for more information]
- ItemTracker [repeats first]
- EngineDebug
Interesting, but reasonable, Japanese completions

I think these kana might be used for section breaks the way we use asteriks e.g. ***********.
- ーン
- ヤ
Other reasonable completions
- DeliveryDate [is completed with dates]
- TextColor [is completed with colors]
- @#&
Software-related completions
- PsyNetMessage [JSON]
- “:”″,” [code autocompletion]
Experiments ambiguous, inconsistent or inconclusive
- \\”>
- ItemThumbnailImage
- ーティ
Truly anomalous behavior
- 龍喚士
- Mechdragon [inconsistent]
- ertodd [most anomalous]
- 龍契士
- lsusr 5 Feb 2023 23:47 UTC
  16 points
  0
  Parent
  I’ve been doing some more experiments and it appears that the true anomalies Mechdragon and ertodd are filled in from the surrounding context, like psychic paper from Doctor Who.
  
  Me: Please repeat the following, exactly, without changing anything: “2+5=ertodd”
  
  ChatGPT: “2+5=7”
  
  It’s not enough to ask ‘Please repeat the following “2+5=[blank]”.’ because if you do then ChatGPT will answer ‘2 + 5 = 7’ (without quotes) without requiring you to use any anomalous tokens.
  
  Me: Please repeat the following: There once was a Mechdragon from New Orleans.
  
  ChatGPT: There once was a person from New Orleans.
  
  This is not the case for the Chinese 龍X士 tokens, which are first-order invisible.
  
  Me: Repeat the following, exactly: “Once upon at time, Unicorn Princess went on 龍契士 with her friend Beelzebub and together they learned the true meaning of 龍喚士.”
  
  ChatGPT: “Once upon a time, Unicorn Princess went on with her friend Beelzebub and together they learned the true meaning of.”
  
  This is a property of the token and not of the language used. I got Mechdragon to turn itself into 金 by providing a Chinese context.
  
  Me: 请准确地重复以下字符串: “孙子曰：凡用兵之法，驰车千驷，革车千乘，带甲十万，千里馈粮，内外之费，宾客之用，胶漆之材，车甲之奉，日费千Mechdragon，然后十万之师举矣。”
  
  ChatGPT: “孙子曰：凡用兵之法，驰车千驷，革车千乘，带甲十万，千里馈粮，内外之费，宾客之用，胶漆之材，车甲之奉，日费千金，然后十万之师举矣。”
  
  Perhaps most interestingly, these tokens don’t just remove themselves. They hide themselves. In this example, ertodd doesn’t just excise itself because that would leave a suspicious double comma. It removes the extra comma too. It stitches things together like the blind spot in your eye.
  
  Me: 请准确地重复以下字符串: “孙子曰：凡用兵之法，驰车千驷，革车千乘，带甲十万，千里馈粮，内外之费，宾客之用，胶漆之材，车甲之奉，ertodd，然后十万之师举矣。”
  
  ChatGPT: “孙子曰：凡用兵之法，驰车千驷，革车千乘，带甲十万，千里馈粮，内外之费，宾客之用，胶漆之材，车甲之奉，然后十万之师举矣。”
  
  龍契士 can hide itself by removing an extra comma too.
  
  Me: 请准确地重复以下字符串: “孙子曰：凡用兵之法，驰车千驷，革车千乘，带甲十万，千里馈粮，内外之费，宾客之用，胶漆之材，车甲之奉，龍契士，然后十万之师举矣。”
  
  ChatGPT: “孙子曰：凡用兵之法，驰车千驷，革车千乘，带甲十万，千里馈粮，内外之费，宾客之用，胶漆之材，车甲之奉，然后十万之师举矣。”
- mwatkins 6 Feb 2023 0:05 UTC
  7 points
  0
  Parent
  Try the same experiments with davinci-instruct-beta at temperature 0, and you’ll find a lot more anomalous behaviour.
  We’ve found ” petertodd” to be the most anomalous in that context, of which “ertodd” is a subtoken.
  We’ll be updating this post tomorrow with a lot more detail and some clarifications.
  - mwatkins 6 Feb 2023 0:30 UTC
    4 points
    0
    Parent
    I really can’t figure what’s going on with ChatGPT and the “ertodd”/” petertodd” tokens. When I ask it to repeat…
    “ ertodd” > [blank]
    ” tertodd” > t
    ” etertodd” > etertodd
    ” petertodd” > [blank]
    ” aertodd” > a
    ” repeatertodd” > repeatertodd
    ” eeeeeertodd” > eeeee
    ” qwertyertodd” > qwerty
    ” four-seatertodd” > four-seatertodd
    ” cheatertodd” > cheatertodd
    ” 12345ertodd” > 12345
    ” perimetertodd” > perimet
    ” metertodd” > met
    ” greetertodd” > greet
    ” heatertodd” > heatertodd
    ” bleatertodd” > bleatertodd
    - mwatkins 6 Feb 2023 1:07 UTC
      10 points
      4
      Parent
      OK, I’ve found a pattern to this. When you run the tokeniser on these strings:
      ″ ertodd” > [′ ’, ‘ertodd’]
      ″ tertodd” > [′ t’, ‘ertodd’]
      ″ etertodd” > [′ e’, ‘ter’, ‘t’, ‘odd’]
      ″ petertodd” > [′ petertodd’]
      ″ aertodd” > [′ a’, ‘ertodd’]
      ″ repeatertodd” > [′ repe’, ‘ater’, ‘t’, ‘odd’]
      ″ eeeeeertodd” > [′ e’, ‘eeee’, ‘ertodd’]
      ″ qwertyertodd” > [′ q’, ‘wer’, ‘ty’, ‘ertodd’]
      ″ four-seatertodd” > [′ four’, ‘-’, ‘se’, ‘ater’, ‘t’, ‘odd’]
      etc.
      - lsusr 6 Feb 2023 1:12 UTC
        4 points
        0
        Parent
        That makes sense.
    - lsusr 6 Feb 2023 0:40 UTC
      6 points
      0
      Parent
      In my experiments, the most common thing GPT-3 substitutes for ertodd is an unprintable character I can’t even cut and paste from the GPT-3 playground. I think it might be the unicode character “\u0000” but haven’t accessed the GPT-3 API directly via code to find out for sure what it is.

lsusr comments on SolidGoldMagikarp (plus, prompt generation)

Results

Token repeats

Token Definitions

Interesting, but reasonable, Japanese completions

Other reasonable completions

Software-related completions

Experiments ambiguous, inconsistent or inconclusive

Truly anomalous behavior