Since part of the WebText dataset (used to train GPT2, and possibly to “train” its tokenizer) are public, we have another avenue to explore.
I adapted code from an old notebook I wrote to explore the public WebText shard, originally written for this post in 2020. Using it, I found examples containing a number of the “weird” tokens. Here’s a Colab link.
Results of particular interest:
The “dragon cluster” seems to originate in a very specific type of document, partly in Japanese and partly in English, that looks like a mangled dump from a wiki or something about Puzzles & Dragons. Example:
Stats Growth Chart HP: Normal ATK: Normal RCV: Normal HP | Attack | Recover vs Level HP | Attack | Recover vs Experience Compare Reincarnated Leilan with .. Please Select 100%の力・戸愚呂弟 2体で最強の妖, Ushio & Tora 2nd Player Color Andy Bogard 2nd Player Color Athena Asamiya 2nd Player Color Benimaru Nikaido 2nd Player Color Billy Kane 2nd Player Color Kim Kaphwan 2nd Player Color Yuri Sakazaki 3rd Player Color Chin Getsai 3rd Player Color King 3rd Player Color Takuma Sakazaki 3rd Shinsengumi Unit Capt., Saito Hajime 5 Mechdragon Combo, Demon Hadar 5 Mechdragon Fusion, God Canopus 5-Ore Magic Stone Dragon, Mithril Edge 6聖球・サタンマリア 7th Heaven's Owner, Tifa 80%の力・戸愚呂弟 A member of Squad 13, Rukia Kuchiki 堕転したマギ・ジュダル 切札勝舞のスペシャルデッキ 刃龍喚士・リエト 寄道の親愛神・サクヤ 審美的転生注射, Zazan 師団長, Colt 帰ってきたサイヤ人, Vegeta 万天の全能神・ゼウス=ヴァース 三橋&伊藤【原作版】 三船東のエース・茂野吾郎 不破圓明流継承者・不破北斗 七代目武装戦線副頭・藤代拓海 七代目武装戦線頭・村田将五 快援隊名刺 忍ギガ満 志村妙 志村新八 呪紋の化身 エキドナロココ クリスタル・パラディン クリームヒルト ジャスタウェイ ジュスティーヌ&カロリーヌ ジョイラの使い魔 ジン=フリークス やさしい王様・ガッシュ&高嶺清麿 カイト カオス セラの天使 アクア・サーファー アイランドガチャドラ アラジン【原作版】 アテナの使命・沙織 ガンダー ガッシュ&高嶺清麿 ギガ満助 サウスポーの守護神・アテナ サイバー・N・ワールド サーティワン・エメリット サーティワン・アメリット サーティワン・サファリット サーティワン・愛猫神・バステト サーティワン・トパリット サーティワン・ルビリット サーティワン・ダブエメリット サーティワン・ダブアメリット サーティワン・ダブサファリット サーティワン・ダブトパリット サーティワン・ダブルビリット サーティワン・バステト サンタクロース ザ・ニンジャ ザブゴン ザブシャーク シェル・ファクトリーγ シェル・フォートレス シヴ山のドラゴン シャーマンカーン シャーマンラーン シーファン シンデレラ ゼオン&デュフォー ゼリーエンジェル スサノオ王子 スーパー覚醒マシンゼウス スーパー超覚醒ゼウス コカ・コーラたまドラ コルト隊兵隊長, Rammot コロッケ コッコ・ルピア あざ笑う雪だるま・ジャックフロスト 坂本辰馬 キャシー・クレイジー キューピッド キン肉族超人予言書 キリン 坂田銀時 坂田銀時 坂田
There are ~40 of these in the shard, implying maybe ~1000 in full WebText.
rawdownloadcloneembedreportprint and friends originate in mangled Pastebin dumps, which are somewhat common in WebText, as I noted in the 2020 post.
This is also where I found the counting subreddit users. There are several docs in the shard which look like this:
1042k thread a guest Apr 7th, 2016 50 Never a guest50Never
Not a member of Pastebin yet? Sign Up , it unlocks many cool features!
rawdownloadcloneembedreportprint text 68.60 KB 4driue 1042001 (1042001) from Ynax at 2016-04-07 15:23:14 (id d1tmbyw) 1042002 (1042002) from CatchMeIYC at 2016-04-07 15:23:22 (id d1tmc5n) 1042003 (1042003) from Mooraell at 2016-04-07 15:23:54 (id d1tmd1b) 1042004 (1042004) from TheNitromeFan at 2016-04-07 15:24:03 (id d1tmdaz) 1042005 (1042005) from CatchMeIYC at 2016-04-07 15:24:16 (id d1tmdoh) 1042006 (1042006) from TheNitromeFan at 2016-04-07 15:24:29 (id d1tme1j) 1042007 (1042007) from cupofmilo at 2016-04-07 15:24:35 (id d1tme6r) 1042008 (1042008) from TheNitromeFan at 2016-04-07 15:24:43 (id d1tmees) 1042009 (1042009) from cupofmilo at 2016-04-07 15:24:50 (id d1tmelq) 1042010 (1042010) from CatchMeIYC at 2016-04-07 15:25:10 (id d1tmf6d) 1042011 (1042011) from TheNitromeFan at 2016-04-07 15:25:19 (id d1tmfey) 1042012 (1042012) from CatchMeIYC at 2016-04-07 15:25:30 (id d1tmfrb) 1042013 (1042013) from TheNitromeFan at 2016-04-07 15:26:10 (id d1tmgw4) 1042014 (1042014) from Mooraell at 2016-04-07 15:27:36 (id d1tmjct) 1042015 (1042015) from TheNitromeFan at 2016-04-07 15:28:11 (id d1tmkcm) 1042016 (1042016) from cupofmilo at 2016-04-07 15:28:28 (id d1tmkua) 1042017 (1042017) from TheNitromeFan at 2016-04-07 15:28:37 (id d1tml4h) 1042018 (1042018) from cupofmilo at 2016-04-07 15:28:46 (id d1tmld0) 1042019 (1042019) from TheNitromeFan at 2016-04-07 15:29:00 (id d1tmlr8) 1042020 (1042020) from cupofmilo at 2016-04-07 15:29:12 (id d1tmm45) 1042021 (1042021) from TheNitromeFan at 2016-04-07 15:29:23 (id d1tmmg2) 1042022 (1042022) from cupofmilo at 2016-04-07 15:29:28 (id d1tmmld) 1042023 (1042023) from TheNitromeFan at 2016-04-07 15:29:41 (id d1tmmzx) 1042024 (1042024) from cupofmilo at 2016-04-07 15:29:45 (id d1tmn34) 1042025 (1042025) from TheNitromeFan at 2016-04-07 15:30:05 (id d1tmno4) 1042026 (1042026) from cupofmilo at 2016-04-07 15:30:10 (id d1tmnrz) 1042027 (1042027) from TheNitromeFan at 2016-04-07 15:30:15 (id d1tmnxa) 1042028 (1042028) from cupofmilo at 2016-04-07 15:30:20 (id d1tmo1z) 1042029 (1042029) from TheNitromeFan at 2016-04-07 15:30:26 (id d1tmo83) 1042030 (1042030) from cupofmilo at 2016-04-07 15:30:30 (id d1tmoc7) 1042031 (1042031) from TheNitromeFan at 2016-04-07 15:30:36 (id d1tmoie) 1042032 (1042032) from cupofmilo at 2016-04-07 15:30:40 (id d1tmons) 1042033 (1042033) from TheNitromeFan at 2016-04-07 15:30:47 (id d1tmoue) 1042034 (1042034) from cupofmilo at
Note that TheNitromeFan appears 15 times in this example.
gmaxwell appears 32 times in this document, suggesting a possible source:
You are currently viewing all ratings received by user gmaxwell.
[view received] || [view sent]
[view negative] || [view all]
This user is currently NOT
AUTHENTICATED. This user has not authenticated for more than 238 days. If you are currently talking to someone who claims to be this person, you may be talking
to an impostor and scammer.
id rater nick rater total rating rated nick created at
(UTC) rating notes
10141 nanotube 801 gmaxwell 2012-04-05 04:12:50 6
generally trustworthy person, bitcoin dev.
14774 pigeons 248 gmaxwell 2012-09-15 13:25:31 3 he seems dedicated to the success of bitcoin
19465 Ssateneth 235
gmaxwell 2013-01-07 18:46:55 10 Kicks and bans scammers from #bitcoin-otc. Also, extra rating added to offset a negative rating from a pissed off scammer.
10182 copumpkin 229 gmaxwell 2012-04-08 16:56:10 8 not only do I trust him, but I have to counteract negative ratings that have very little to do with his
actual trustworhiness
7497 cory 222 gmaxwell 2011-10-23 02:40:10 1 He sent me a MtGox code in exchange for BTC
27672 Cusipzzz 195 gmaxwell 2013-07-19 21:18:55
7 very trustworthy, do not let the spam negative ratings fool you
10063 mircea_popescu 181 gmaxwell 2012-04-08 16:54:19 -10 hypocritical idiot.
10142 rg 159
gmaxwell 2012-06-11 19:52:28 1 you are a pain in my ass. :)
19063 TheButterZone 106 gmaxwell 2013-04-21 23:41:33 9 Warned me about continued use of an old
version of pseudo-client that would soon stop pushing valid transactions.
14534 jgarzik 95 gmaxwell 2012-09-08 16:45:57 8
13526 foggyb 88 gmaxwell 2012-08-11
02:10:16 3 made a donation on my behalf
19019 amiller 70 gmaxwell 2012-12-25 04:15:09 2 Met in person
18033 theymos 61 gmaxwell 2012-11-28 02:08:51 8
32581
iwilcox 61 gmaxwell 2013-12-14 19:28:01 2 Based on months of interactions; haven't transacted
14420 midnightmagic 54 gmaxwell 2013-09-04 00:32:07 6 Kind of a
hero of mine.
11643 Blitz 51 gmaxwell 2012-06-11 19:29:29 1 i love this guy
33637 Namworld 47 gmaxwell 2014-02-09 11:48:11 3 1|45 BTC|Gox instant withdrawal
service when gox withdrawals not working.
38067 chmod755 45 gmaxwell 2015-08-19 12:59:58 -10
12127 guruvan 43 gmaxwell 2012-06-26 20:53:43 1 highly respected
dev - definitely has his eye out for scams and things not good for your bitcoins :) never see him trade, but I trust this guy to be honest for sure.
30661
coingenuity 39 gmaxwell 2013-10-01 18:30:01 5 Great guy, trustworthy. Would do any size transaction.
19011 luke-jr 36 gmaxwell 2012-12-25 04:11:00 2 Seems
level-headed, met in person; not had the occasion to do business yet.
7536 vragnaroda 32 gmaxwell 2011-10-26 02:34:13 2
27666 anduck 28 gmaxwell 2013-07-19
19:12:30 3 trusted
23665 warren 27 gmaxwell 2013-04-11 19:11:48 10 Real person, bitcoin developer, otc op
20552 ATC 26 gmaxwell 2013-02-15 06:14:51 5 Helped
me save over 9.00 BTC stuck in my corrupted wallet. Thanks!!!
8123 nkr 24 gmaxwell 2011-12-12 18:21:03 1
14407 Vandroiy 23 gmaxwell 2012-09-06 20:04:53 2
Helps defend protocol and chat against nonsense. :)
33938 nkuttler 18 gmaxwell 2014-06-05 18:46:57 1 seems trustworthy
6375 cydeweys 14 gmaxwell 2011-07-25
17:49:45 8
20083 MoneypakTrader 13 gmaxwell 2013-01-28 22:34:39 -2 neg rated me based on opinion, msg after removed and I'll remove
7493 TehRabbitt 8 gmaxwell
2011-10-22 22:35
Mangled, mixed English-Japanese text dumps from a Puzzle & Dragons fandom wiki is exactly the kind of thing I imagined could have resulted in those strings becoming tokens. Good find.
The most convincing partial explanation I’ve heard for why some tokens glitch is because those token strings appear extremely rarely in the training corpus, so GPT “doesn’t know about them”.
But if, in GPT training, the majority of the (relatively few) encounters with ′ Leilan’ occurred in fan-fiction (where she and Metatron are battling Satan, literally) might this account for all the crazy mythological and apocalyptic themes that spill out if you prompt it about ′ Leilan’?
Greg Maxwell of ′ gmaxwell’ fame said in a comment that
both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin.
So if, in GPT training, the majority of the (relatively few) encounters with ′ petertodd’ occurred in defamatory contexts or contexts involving harassment, accusations, etc., might this account for all the negativity, darkness and unpleasant semantic associations GPT has somehow made with that token?
Yes, this is a plausible source for ‘gmaxwell’ (and much more plausible than his two suggestions). Still leaves “PeterTodd” (camelcase) a mystery, however: Todd was an OTC user but not a very active one, and as “petertodd” (all-lowercase), apparently.
Since part of the WebText dataset (used to train GPT2, and possibly to “train” its tokenizer) are public, we have another avenue to explore.
I adapted code from an old notebook I wrote to explore the public WebText shard, originally written for this post in 2020. Using it, I found examples containing a number of the “weird” tokens. Here’s a Colab link.
Results of particular interest:
The “dragon cluster” seems to originate in a very specific type of document, partly in Japanese and partly in English, that looks like a mangled dump from a wiki or something about Puzzles & Dragons. Example:
There are ~40 of these in the shard, implying maybe ~1000 in full WebText.
rawdownloadcloneembedreportprint and friends originate in mangled Pastebin dumps, which are somewhat common in WebText, as I noted in the 2020 post.
This is also where I found the counting subreddit users. There are several docs in the shard which look like this:
Note that TheNitromeFan appears 15 times in this example.
gmaxwell appears 32 times in this document, suggesting a possible source:
Mangled, mixed English-Japanese text dumps from a Puzzle & Dragons fandom wiki is exactly the kind of thing I imagined could have resulted in those strings becoming tokens. Good find.
The most convincing partial explanation I’ve heard for why some tokens glitch is because those token strings appear extremely rarely in the training corpus, so GPT “doesn’t know about them”.
But if, in GPT training, the majority of the (relatively few) encounters with ′ Leilan’ occurred in fan-fiction (where she and Metatron are battling Satan, literally) might this account for all the crazy mythological and apocalyptic themes that spill out if you prompt it about ′ Leilan’?
Greg Maxwell of ′ gmaxwell’ fame said in a comment that
So if, in GPT training, the majority of the (relatively few) encounters with ′ petertodd’ occurred in defamatory contexts or contexts involving harassment, accusations, etc., might this account for all the negativity, darkness and unpleasant semantic associations GPT has somehow made with that token?
Oh, I recognize that last document—it’s a userpage from the bitcoin-otc web of trust. See: https://bitcoin-otc.com/viewratings.php
I expect you’ll also find petertodd in there. (You might find me in there as well—now I’m curious!)
EDIT: According to https://platform.openai.com/tokenizer I don’t have a token of my own. Sad. :-(
Yes, this is a plausible source for ‘gmaxwell’ (and much more plausible than his two suggestions). Still leaves “PeterTodd” (camelcase) a mystery, however: Todd was an OTC user but not a very active one, and as “petertodd” (all-lowercase), apparently.