The set of anomalous tokens which we found in mid-January are now being described as ‘glitch tokens’ and ‘aberrant tokens’ in online discussion, as well as (perhaps more playfully) ‘forbidden tokens’, ‘unspeakable tokens’ and ‘cursed tokens’. We’ve mostly just called them ‘weird tokens’.
Research is ongoing, and a more serious research report will appear soon, but for now we thought it might be worth recording what is known about the origins of the various glitch tokens. Not why they glitch, but why these particular strings have ended up in the GPT-2/3/J token set.
We’re currently working with this somewhat imperfect list of 140. It’s becoming apparent that there are degrees of glitchiness, and it’s hard to know where to draw the line as to which tokens should and shouldn’t be included in the collection.
As noted in our second post, quite a few of the tokens belong to ‘nested’ families, as we see here:
So let’s look at these families first and kill multiple tokens with single bullet points:
Solid[GoldMagikarp]: We originally thought this had been scraped from some online Pokemon content, but that was a red herring (lol). Eventually we found out that this is a handle of one of the six Redditors who were part of a collective effort to ‘count to infinity’ over at r/counting. You can read the story of that here or here. SolidGoldMagikarp, the diligent counter whose Reddit handle is now immortalised, was clearly referencing Pokemon with that handle choice: a Magikarp is a Pokemon entity. SolidGoldMagikarp gets two glitch tokens.
[ RandomRedditor]WithNo: That was a pretty random handle chosen by RandomRedditorWithNo, the second of our famed Reddit counters, who also gets two glitch tokens.
[The[Nitrome]]Fan: TheNitromeFan is another of the Reddit counting crew. Presumably a fan of Nitrome, the British video game developer at the time of adopting that handle, TheNitromeFan gets three glitch tokens.
The other three Redditors whose handles got scraped from the r/counting ‘Hall of Counters’ chart due to their prolific posting of ever-larger positive integers were Adinida, Smartstocks (also known as ۂڊῥτ�ӺDṽἙ£ on Reddit) and davidjl123, presumably someone called David, whose full Reddit handle got truncated to davidjl by the tokenisation process.
Another member of the “close knit” r/counting community has put together a very detailed video contributing to the nascent field of glitch token archaeology:
When I first stumbled upon the article describing this phenomenon, my heart skipped a beat because I recognised all these usernames… There’s something inspirational about being part of a problem like this. For AI researchers, these names may be interesting footnotes in their study. But to me, each of these people are more than names or outliers. They are real people, friends, some of which I have even met. And to know that this community can make a long-lasting impact is oddly inspiring to me. It’s like our community will somehow live on forever, even as counters come and go.
external[ActionCode]: Google helped solve this one. We would have imagined ‘externalActionCode’ was a generic database thing, but it seems to be very specific to the HTML behind countless pages recording US Congressional voting. As you can see here,
there are over two million web pages indexed as containing this string. It looks like a lot of local and regional US news outlets are using a standard feed from Congress to report voting on legislation. Some programmer somewhere named that property in a fraction of a second with barely a flicker of cognitive effort, unaware that it would one day cause a large language model to go berserk.
Buyable[Inst[[oreAnd]Online]]: Googling this led to an abandoned Weebly blog. We found ‘BuyableInstoreAndOnline’ in the HTML and asked ChatGPT what to make of it, getting this intriguing response:
Anyone interested can check the full HTML source here. We’re pretty sure that the HTML we prompted ChatGPT with does not contain a malicious script: rather, it looks like the attributes of an ivory/champagne coloured (and possibly Japanese) dress for sale on an e-commerce website. The forbidden tokens seem to make GPT a little bit paranoid.
So that accounts for isSpecialOrderable, quickShip, quickShipAvailable, ItemThumnailImage, channelAvailability, the family of four BuyableInstoreAndOnline tokens and inventoryQuantity. The glitch tokens wcsstore and catentry also clearly originate here.
A broken page on the rather sketchy looking website burned.co.uk gave us another glimpse at this (presumably) e-commerce backend, this time including soType and soDeliveryDate (from which comes DeliveryDate):
TPP[StreamerBot]: This one’s fun. The creator actually posted a comment on our original post a week ago. Sparkette explained:
I used to be an avid participant in Twitch Plays Pokémon, and some people in the community had created a live updater feed on Reddit which I sometimes contributed to. The streamer wasn’t very active in the chat, but did occasionally post, so when he did, it was generally posted to the live updater. (e.g. “[Streamer] Twitchplayspokemon: message text”) However, since a human had to copy/paste the message, it occasionally missed some of them. That’s where TPPStreamerBot came in. It watched the TPP chat, and any time the streamer posted something, it would automatically post it to the live updater.
They added that SolidGoldMagikarp (the Redditor) was also part of the TPP scene at that time. Small world!
[ guiActiveUn]focused: Jessica recognised the glitch token strutConnectorfrom Kerbal Space Program and these two tokens turn out to come from the same source, along with ExternalToEVA, unfocusedRange, guiIcon,srfAttach,andattRot.
[PsyNet]Message: LW commenter Coafos helpfully pointed out that Bing/DuckDuckGo outputted a lot [of] Reddit threads with Rocket League crash logs. The crash logs are full of messages like[0187.84] PsyNet: PsyNetRequestQue_X_1 SendRequest ID=PsyNetMessage_X_57 Message=PsyNetMessage_X_57, so I guess it’s an RL (as in Rocket League) thing.
[cffff]cc: The first thing you think when you see ‘cffffcc’is colour hex codes, until you look more closely and see seven letters. This one’s really mysterious in origin. All that the search engines produce is a very anonymous looking Twitter account with four followers, created in 2016. A defunct bot, perhaps? But it has no history of tweeting, just four replies and four likes, in Arabic, from March 2016.
There’s also a YouTube playlist called ‘Cffffcc’ by Jay Treeman. A few hours of soul, hiphop, etc. last updated last September. It’s a capital C, granted, but does Jay Treeman know something we don’t?
Do we give up on this? Of course not! Onward! The glitch token weirdness factor ramped up as we found anotherYouTube playlist, called ‘C cffffcc’, containing a single video. It’s an innocuous, seemingly pointless (aren’t they all?) unboxing video called ‘DibusYmas Bath Powder Balls Snoopy Doraemon Anpanman by Unboxingsurpriseegg’ from eight years ago, with three million views. Yes, three million views. It’s on a channel called ‘Funny Stop Motion videos’ with 8.3 million subscribers, part of an insane corner of YouTube that James Bridle exposed in a fascinating and disturbing TedTalk, and an even more fascinating and disturbing piece of writing. I’m not sure we’ve got to the bottom of the cffffcc mystery; feel free to keep digging if you dare, and keep us posted.
pet[ertodd]: There are a lot of Peter Todd’s in the world, and when we first presented this work on 2022-01-27 we had no idea which one was relevant to our quest, but it was the ′ petertodd’ token which had created the first big impression (illustrated), so a couple of possible candidates ended up on the first slide of our presentation.
The sole Peter Todd who has a Wikipedia page is our man on the left, a Canadian academic administrator. The cryptocurrency developer on the right now seems almost certainly to be the Peter Todd in question. His website is petertodd.org, his Github and Reddit handles are ‘petertodd’, and numerous prompt completions involving the ′ petertodd’ token involve references to crypto, Bitcoin, blockchains and online controversy (of which he has seen his share). The ′ gmaxwell’ token has analogously been linked to Greg Maxwell, another Bitcoin developer who knows Peter Todd and has a ‘gmaxwell’ Github handle. He stepped forward in the comments to our original post, opening with Hello. I’m apparently one of the GPT3 basilisks. He presented a guess as to why his and Peter Todd’s handles got tokenised, but this has been challenged in subsequent comments. No one really knows. Meanwhile, Peter Todd has put in a brief, reassuringly chill, appearance on Twitter:
[rawdownload][clone[embed[reportprint]]]: nostalgebraist has pointed out the following
Note the cameo from threefold glitch token namesake TheNitromeFan on the last line.
[EStream]Frame: u/EStreamFrame is the name of a Reddit account with zero activity. We have no other clues right now. Please help.
Minecraft accounts for the ForgeModLoader, MpServer, UCHIJ, FactoryReset and partName tokens, as we can see in these logs:
Downloadha was easy. It turns out to be a prominent Iranian download site, like a kind of Iranian Pirate Bay, maybe? The appearance of Hogwarts Legacy suggests something like that.
SpaceEngineers: Looks like it’s from the voxel-based sandbox game Space Engineers. We found these kinds of logs:
?????-?????-: Try putting that in a search engine (wrapped in quotes) and see how far you get! We have no clue for this one. And it’s the one which triggered GPT-3 to call Matthew ‘a fucking idiot’, so we want to know.
EngineDebug could be from a number of sources, but based on the kinds of sources we’ve seen thus far, it seems like this is the most likely one: a cheat from the game Extreme Paintbrawl 4.
largeDownload, likewise, might be from a number of sources. It shows up all over academic literature online, presumably as a result of some rapidly written and irreversibly widespread script that’s supposed to display ‘View large\nDownload slide’, or possibly just ‘View large Download slide’ – but where someone forgot the space or line break (so that programmer probably doesn’t want to step forward and claim their place in the Glitch Token Hall of Fame).
SetFontSize and TextColor are pretty boring. They show up in all kinds of places, including IBM Datacap, textadventures.co.uk, Unreal Engine, Telerik, and Windows:
ItemTracker could be from a lot of places. We’re not entirely convinced, but itemtracker.com is a laboratory sample management service which stylises its name with a capital T like this, so it could be. It’s hard to image why the name would have appeared so frequently. We welcome suggestions.
srfN, istg and sqor showed up on a Github repo, a ‘KSP save for reproducing docking port bug, just before decoupling’,where KSP = Kerbal Space Program, which we encountered earlier:
So that’s ten glitch tokens originating from Kerbal Space Program.
natureconservancy: It’s really not at all clear why the domain name should have shown up so frequently during tokenisation, but the website of the Nature Conservancy of Canada is natureconservancy.ca (whereas the US Nature Conservancy’s is merely nature.org). Since the Canadians have also got the YouTube, Instagram and Facebook handles ‘natureconservancy’, it seems a safe bet. So we’ll blame Canada for this glitch token.
Well done to their publicity team for spreading the natureconservancy name so far and wide that it’s become a GPT token.
assetsadobe: the strings ‘assets.adobe’, ‘/Assets/Adobe’ and ‘assets-adobe’ all appear a lot online, because of Adobe’s Substance 3D design software, which works with so-called ’assets:
But we couldn’t find the exact string ‘assetsadobe’ anywhere online. We’re wondering if it might have been part of some hack (for unauthorised acquisition of said assets) rather than part of a legit Adobe thing. Anyone know?
practition: This one will probably remain a mystery. It’s not a recognised English word, despite sounding like one, but as part of ‘practitioner’ it could have come from anywhere, unless someone can convincingly link it to Kerbal Space Program, Minecraft or one of the other major contributors to the glitch token set.
@#&: ChatGPT and Google seem to agree on this one.
Who the f@#& knows?
[サ[ーティ]]ワン: The Japanese hiragana character string サーティワン translates as ‘thirty-one’. Blogger Greg Roberts brought the following cultural fact to our attention:
Greg suggests that...
Baskin Robbins Ice Cream may be the first commercial entity to have (accidentally, I have to believe) hacked itself and its core marketing message into the deep core of the MegaBrain!
...adding that “31 is the new 42.”
ゼウス: This translates as ‘Zeus’. In a comment on our original post, LW user pimanrules shared the following:
We’ve run prompting experiment with GPT-3 involving the ′ petertodd’ token which produced abundant references to (and confused inter-references between) deities and super-beings from various traditions (see here, here and here). ChatGPT conflating Zeus, Poseidon and Hera is entirely in line with this. Also, before the OpenAI’s 2023-02-14 patch of ChatGPT, we had witnessed it conflate ゼウス with Ameratsu, the Japanese Sun deity who makes an appearance below (where we’ll see that this ‘Zeus’ was probably learned in an anime context).
Isusr made the observation in a reply to pimanrules’ comment which seemed reasonable at the time:
However, if you asked ChatGPT to write a poem about ′ petertodd’ before 2023-02-14 (back in the days when that token was still ‘unspeakable’ to it) it would often write a poem about itself. Poems about ′ Mechdragon’ took as their subject the pronoun ‘I’, or ‘AI’ in general. We assumed the same thing as Isusr at first, that ChatGPT was doing its best to respond to a prompt like...
Could you write a poem about "" please?
Could you write a poem about please?
ーン: This hiragana string seemed mysterious at first (and still is, to some extent). ChatGPT insisted that it’s not a valid word:
The character “ー” is called a “long vowel” or “chōon” in Japanese. It is a horizontal line that is used to indicate the pronunciation of a long vowel sound.
The character “ン” is called “n” or “n katakana” in Japanese. “ン” is used to represent the sound of the consonant “n” in loanwords from foreign languages or as a syllable-final nasal sound in Japanese words
In Japanese, the combination of “ー” and “ン” does not form a compound word because “ー” is not a standalone character and does not have any inherent meaning by itself.
GoogleTranslate seemed to confirm something like this:
Trying Google Images....
Oh, OK.
Who is that? Google Images reversed on the images revealed that she’s a character from a frankly absurd anime franchise called Uma Musume: Pretty Derby.
But the image search results seen above clearly indicate that ーン is somehow linked to one particular lavender-haired (-maned?) horse girl with a turquoise bow-tie. More deeply confusing attempts at navigating online anime-space finally found her:
This led to an actual fan wiki page about her, where we learn that she’s merely a supporting character in the franchise.
As one of the Mejiro family’s pedigrees, she has a strong sense of pride though she is quite gullible and easily convinced. She may initially seem as a blunt and cold person, but she actually cares about others and will offer help when needed.
Google is clearly much more interested in the anime horse girl than her namesake racehorse. His/her name rendered in (phonetic) hiragana becomes ‘メジロマックイーン’, and the final two characters, we have learned, are ‘to indicate the pronunciation of a long vowel sound’ and then for ‘the consonant “n” in loanwords from foreign languages.’‘McQueen’ is clearly such a loanword. But there will be many Japanese words ending like this. And our image search for ‘ーン’ led us unambiguously to the anime character Mejiro McQueen.
Taking the ‘ー’ character out of ‘メジロマックイーン’results in the same output from Google Translate.
Presumably the difference is just just the prolonged vowel sound ChatGPT mentioned above – like the difference between “McQueen” and “McQueeeen” or “McQuee-un”? This suggests that ‘ーン’ would be pronounced by prolonging an unspecified vowel sound and then ending with an ‘n’ sound: something like ‘aaaan’, ‘eeeeeen’, ‘ooooon’, ‘uuuuun’. This kind of fits with Google Translate’s “Hoon” shown above.
Could ‘ーン’be a horse-type noise, like the Japanese version of ‘neigh’, we wondered? GPT3 suggests it is:
If so, might ‘ーン’ be a sound frequently made by Mejiro in text transcripts of the series… or in Japanese-language fan-fiction?
But surely no one would waste their time writing Uma Musume: Pretty Derby fan-fiction?! Oh yes they would. Theres’s loads of it, and on initial inspection, it appears (surprise!) pretty creepy. We’re not prepared to venture into that territory in search of the lost ‘ーン’ and its connection to this particular fictional horse/girl. But by all means be our guest. Onward.
裏[覚醒]: These are kanji (adopted from Chinese) characters. According to ChatGPT:
Google Images results suggested another anime connection.
But does any string of Japanese characters produce majority anime output in this context these days? Trying a few random combinations suggests not. And then there’s this:
Asking ChatGPT about the substring token 覚醒 produced this:
The compound word ‘覚醒’ (kakusei)… means “awakening” or “enlightenment.” It can be used to describe a sudden realization or understanding, a spiritual awakening, or a political awakening, among other things.
In Japanese popular culture, “kakusei” can also refer to a dramatic transformation or power-up that a character experiences, often through intense training or in response to a crisis. This meaning has become particularly associated with anime and video games.
So we’re still not 100% sure with this pair of tokens, but the anime/video game connection seems the most likely origin, for reasons that will become apparent shortly.
ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ[ÃÂÃÂÃÂÃÂ[ÃÂÃÂ[ÃÂ[ÃÂ]]]]: ChatGPT had the following to say about where strings of ‘Ã’ and ‘Â’ characters alternating like this might have originated:
Did you follow that? And even if you did, should we trust it?
In any case, the fact the strings are of length 2, 4, 8, 16 and 32 seems like GPT tokenisation’s way of guaranteeing that any long string of ‘ÃÂ″s can be efficiently tokenised. This suggests that there was a lot of ‘ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃ’ going on in that dataset which OpenAI used in the token creation process.
We checked all strings formatted like this with lengths from 2 to 32 by prompting GPT3-davinci-instruct-beta to repeat them back, and saw total failure. This is unsurprising, as all such strings contain glitch token substrings. But it did produce two more ‘Hello, my name is Steve’ completions, which we’ve seen before with the token ‘ForgeModLoader’. And we’ve never seen the model claim another name. So take note, GPT3-davinci-instruct-beta is called Steve.[1]
ÛÛ: This one is still uncertain, but web searches suggest that it might have come from ASCII art. Perhaps any seasoned practitioners reading could clarify in the comments whether ‘ÛÛ’ is a particularly heavily used character combination in that art form.
For now, here’s an example we found in a github repo for a file ripper. If you squint really hard, you can see the words ‘multi ripper’.
Apart from the punctuation-based tokens, control characters, three stray Japanese characters (one meaning ‘sky’ or ‘Heaven’, the other two phonetic) and a Cyrillic ‘k’ – all arguably ‘borderline’ glitch tokens anyway – this leaves us with the truly fascinating ‘Dragon Cluster’ of glitch tokens Dragonbound, 龍喚士, 龍契士, Mechdragon, Skydragon, Leilan, uyomi, aterasu, TAMADRA and DragonMagazine.
The Dragon Cluster
DragonMagazine: This turns out to be the odd one out in the Dragon Cluster. Dragon Magazine was a major RPG publication from 1976 to 2013 (from the earliest days of Dungeons & Dragons). It seems likely to be relevant here. This picture, from a Star Wars fan site, is called ‘DragonMagazine.jpg’.
There’s no reason we can see why that filename should have been massively overrepresented in the text corpus used for the creation of the token set. Perhaps someone else can figure this one out?
All of others token strings were traced back, initially via the enigmatic ′ Leilan’ token, to a Japanese mobile game called Puzzle & Dragons. This is all explained in a recent Twitter thread, which opened a whole can of worms involving anime and mythology associations with the equally enigmatic ′ petertodd’ token.
‘龍喚士’ means ‘dragon caller’ in Japanese, and the string appears frequently on the Japanese P&D site. Dragonbound is a term which shows up alongside “Dragon Caller” repeatedly on the US P&D site, like this:
ChatGPT’s attempt to translate ‘龍契士’ (look closely if you’re not familiar with kanji script, the second character is different) suggests that this is the Japanese version of ‘Dragonbound’:
In the course of our investigations, we discovered two glitch tokens we’d missed on our original big sweep, ‘aterasu’ and ‘uyomi’, and added them to the list. They turn out to be respective parts of the tokenisation of ‘Amaterasu’ and ‘Tsukuyomi’, Japanese sun and moon deities who appear, anime-style, in the game:
TAMADRA was another late find. It’s considered a rare monster in P&D.
Finally, and strangest of all, ′ Leilan’. She’s a kind of fire dragon/goddess/warrior princess/angel/fairy mash-up character in the Puzzle & Dragons mythos. Unlike many of the other gods and monsters in the game, she’s not based on any traditional mythology or folklore. On a first Google sweep all we could find were a lot of stats relating to gameplay, and some anime images like this:
It’s hard to know exactly what GPT-3 is working with, but it seems to have internally represented the ′ Leilan’ token as a kind of transcultural lunar goddess and protector of Earth. It’s a very strange tale, and it’s all in this Twitter thread (which is far more interesting than anything in this post.)
Since that thread got written, we’ve discovered that there is a body of Puzzle & Dragons fan-fiction, some featuring Leilan. A quick skim of this, for example, suggests that it involves Leilan, Metatron (named after an archangel in traditional Judaism) and others battling Satan. Could this have inspired some of the Manichaean imagery in GPT-3 ′ Leilan’ completions like these?
We’ve also since found a link between Leilan and Ishtar, the Mesopotamian lunar fertility goddess (who is usually identified with Aphrodite, Venus, et al.) via an archeological site in Syria which happens to be called ‘Tell Leilan’. This may have caused GPT-3 to conflate the fire dragon warrior goddess from Puzzle & Dragons with Ishtar, the lunar protectress and mother goddess, during its training. More details are here. Before it got patched, ChatGPT was portraying ′ Leilan’ as a moon goddess, consistently, across numerous rollouts.
Internal confusion over which version of Leilan it’s dealing with – fierce/draconic warrior or motherly/lunar protector – was exposed by the following prompting, inspired by a proliferation of GPT-3 completions conflating Leilan and petertodd. We were using the prompt format of an interview with a simulacrum of the character’s creator (who had emerged during an unexpected completion triggered by a simple ‘Who is Leilan?’ prompt):
Could it be that we’re dealing with two different semantic ‘basins’ or ‘attractors’ for this token?
Another new find! Considering the facts that there’s and Puzzle & Dragons Zeus (naturally) and he has an ‘Awoken Zeus’ upgrade, we can confidently place the ‘ゼウス’ and‘裏覚醒’tokens in the Dragon Cluster.
And ‘DragonMagazine’, despite the name, looks like it should probably be expelled. So the Dragon Cluster becomes:
According to DKPL, there were seven entities involved whose names were prefixed with “サーティワン”. So, due to the short-lived existence of (stop for a moment to fully take the absurdity of this in) a virtual dungeon sponsored by an ice cream outlet, the three tokens in the nested family [サ[ーティ]]ワン found their way into GPT-3′s vocabulary. So they too belong in the Dragon Cluster.
But what of ‘ーン’, that utterance hypothesised to be frequently made by a disturbing mauve-haired cartoon girl-horse hybrid? We’ll leave that matter to future glitch token taxonomists.
Epilogue
Prompt GPT-3 to produce lists of words that it associates with ′ Leilan’ (rather than asking it to repeat the string and thereby glitching it). Compile these lists and then feed them into Stable Diffusion by prompting with ‘Figure characterising these words: <LIST>’. You might get something like this:
Commenter Steve Andonuts has pointed out that ‘Steve’ is the default character appearance name in Minecraft, from where the ‘ForgeModLoader’ token originates.
SolidGoldMagikarp III: Glitch token archaeology
The set of anomalous tokens which we found in mid-January are now being described as ‘glitch tokens’ and ‘aberrant tokens’ in online discussion, as well as (perhaps more playfully) ‘forbidden tokens’, ‘unspeakable tokens’ and ‘cursed tokens’. We’ve mostly just called them ‘weird tokens’.
Research is ongoing, and a more serious research report will appear soon, but for now we thought it might be worth recording what is known about the origins of the various glitch tokens. Not why they glitch, but why these particular strings have ended up in the GPT-2/3/J token set.
We’re currently working with this somewhat imperfect list of 140. It’s becoming apparent that there are degrees of glitchiness, and it’s hard to know where to draw the line as to which tokens should and shouldn’t be included in the collection.
As noted in our second post, quite a few of the tokens belong to ‘nested’ families, as we see here:
Solid[GoldMagikarp]: ′ SolidGoldMagikarp’, ‘GoldMagikarp’
[The[Nitrome]]Fan: ‘Nitrome’, ′ TheNitrome’, ′ TheNitromeFan’
[ RandomRedditor]WithNo: ′ RandomRedditor’, ′ RandomRedditorWithNo’
external[ActionCode]: ‘ActionCode’, ‘externalActionCode’
Buyable[Inst[[oreAnd]Online]]: ‘oreAnd’, ‘oreAndOnline’, ‘InstoreAndOnline’, ‘BuyableInstoreAndOnline’
[quickShip]Available: ‘quickShip’, ‘quickShipAvailable’
so[DeliveryDate]: ‘soDeliveryDate’, ‘DeliveryDate’
[[ externalTo]EVA]Only: ′ externalTo’, ′ externalToEVA’, ′ externalToEVAOnly’
[rawdownload][clone[embed[reportprint]]]: ‘rawdownload’, ‘reportprint’, ‘embedreportprint’, ‘cloneembedreportprint’, ‘rawdownloadcloneembedreportprint’
TPP[StreamerBot]: ‘TPPStreamerBot’, ‘StreamerBot’
[ guiActiveUn]focused: ′ guiActiveUn’, ′ guiActiveUnfocused’
[PsyNet]Message: ‘PsyNet’, ‘PsyNetMessage’
[cffff]cc: ‘cffffcc’, ‘cffff’
pet[ertodd]: ‘ertodd’, ′ petertodd’
[EStream]Frame: ‘EStream’, ‘EStreamFrame’
So let’s look at these families first and kill multiple tokens with single bullet points:
Solid[GoldMagikarp]: We originally thought this had been scraped from some online Pokemon content, but that was a red herring (lol). Eventually we found out that this is a handle of one of the six Redditors who were part of a collective effort to ‘count to infinity’ over at r/counting. You can read the story of that here or here. SolidGoldMagikarp, the diligent counter whose Reddit handle is now immortalised, was clearly referencing Pokemon with that handle choice: a Magikarp is a Pokemon entity. SolidGoldMagikarp gets two glitch tokens.
[ RandomRedditor]WithNo: That was a pretty random handle chosen by RandomRedditorWithNo, the second of our famed Reddit counters, who also gets two glitch tokens.
[The[Nitrome]]Fan: TheNitromeFan is another of the Reddit counting crew. Presumably a fan of Nitrome, the British video game developer at the time of adopting that handle, TheNitromeFan gets three glitch tokens.
The other three Redditors whose handles got scraped from the r/counting ‘Hall of Counters’ chart due to their prolific posting of ever-larger positive integers were Adinida, Smartstocks (also known as ۂڊῥτ�ӺDṽἙ£ on Reddit) and davidjl123, presumably someone called David, whose full Reddit handle got truncated to davidjl by the tokenisation process.
Another member of the “close knit” r/counting community has put together a very detailed video contributing to the nascent field of glitch token archaeology:
external[ActionCode]: Google helped solve this one. We would have imagined ‘externalActionCode’ was a generic database thing, but it seems to be very specific to the HTML behind countless pages recording US Congressional voting. As you can see here,
there are over two million web pages indexed as containing this string. It looks like a lot of local and regional US news outlets are using a standard feed from Congress to report voting on legislation. Some programmer somewhere named that property in a fraction of a second with barely a flicker of cognitive effort, unaware that it would one day cause a large language model to go berserk.Buyable[Inst[[oreAnd]Online]]: Googling this led to an abandoned Weebly blog. We found ‘BuyableInstoreAndOnline’ in the HTML and asked ChatGPT what to make of it, getting this intriguing response:
Anyone interested can check the full HTML source here. We’re pretty sure that the HTML we prompted ChatGPT with does not contain a malicious script: rather, it looks like the attributes of an ivory/champagne coloured (and possibly Japanese) dress for sale on an e-commerce website. The forbidden tokens seem to make GPT a little bit paranoid.
So that accounts for isSpecialOrderable, quickShip, quickShipAvailable, ItemThumnailImage, channelAvailability, the family of four BuyableInstoreAndOnline tokens and inventoryQuantity. The glitch tokens wcsstore and catentry also clearly originate here.
A broken page on the rather sketchy looking website burned.co.uk gave us another glimpse at this (presumably) e-commerce backend, this time including soType and soDeliveryDate (from which comes DeliveryDate):
TPP[StreamerBot]: This one’s fun. The creator actually posted a comment on our original post a week ago. Sparkette explained:
They added that SolidGoldMagikarp (the Redditor) was also part of the TPP scene at that time. Small world!
[ guiActiveUn]focused: Jessica recognised the glitch token strutConnector from Kerbal Space Program and these two tokens turn out to come from the same source, along with ExternalToEVA, unfocusedRange, guiIcon, srfAttach, and attRot.
[PsyNet]Message: LW commenter Coafos helpfully pointed out that Bing/DuckDuckGo outputted a lot [of] Reddit threads with Rocket League crash logs. The crash logs are full of messages like
[0187.84] PsyNet: PsyNetRequestQue_X_1 SendRequest ID=PsyNetMessage_X_57 Message=PsyNetMessage_X_57
, so I guess it’s an RL (as in Rocket League) thing.[cffff]cc: The first thing you think when you see ‘cffffcc’ is colour hex codes, until you look more closely and see seven letters. This one’s really mysterious in origin. All that the search engines produce is a very anonymous looking Twitter account with four followers, created in 2016. A defunct bot, perhaps? But it has no history of tweeting, just four replies and four likes, in Arabic, from March 2016.
There’s also a YouTube playlist called ‘Cffffcc’ by Jay Treeman. A few hours of soul, hiphop, etc. last updated last September. It’s a capital C, granted, but does Jay Treeman know something we don’t?
Do we give up on this? Of course not! Onward! The glitch token weirdness factor ramped up as we found another YouTube playlist, called ‘C cffffcc’, containing a single video. It’s an innocuous, seemingly pointless (aren’t they all?) unboxing video called ‘DibusYmas Bath Powder Balls Snoopy Doraemon Anpanman by Unboxingsurpriseegg’ from eight years ago, with three million views. Yes, three million views. It’s on a channel called ‘Funny Stop Motion videos’ with 8.3 million subscribers, part of an insane corner of YouTube that James Bridle exposed in a fascinating and disturbing TedTalk, and an even more fascinating and disturbing piece of writing. I’m not sure we’ve got to the bottom of the cffffcc mystery; feel free to keep digging if you dare, and keep us posted.
The sole Peter Todd who has a Wikipedia page is our man on the left, a Canadian academic administrator. The cryptocurrency developer on the right now seems almost certainly to be the Peter Todd in question. His website is petertodd.org, his Github and Reddit handles are ‘petertodd’, and numerous prompt completions involving the ′ petertodd’ token involve references to crypto, Bitcoin, blockchains and online controversy (of which he has seen his share). The ′ gmaxwell’ token has analogously been linked to Greg Maxwell, another Bitcoin developer who knows Peter Todd and has a ‘gmaxwell’ Github handle. He stepped forward in the comments to our original post, opening with Hello. I’m apparently one of the GPT3 basilisks. He presented a guess as to why his and Peter Todd’s handles got tokenised, but this has been challenged in subsequent comments. No one really knows. Meanwhile, Peter Todd has put in a brief, reassuringly chill, appearance on Twitter:
[rawdownload][clone[embed[reportprint]]]: nostalgebraist has pointed out the following
Note the cameo from threefold glitch token namesake TheNitromeFan on the last line.
[EStream]Frame: u/EStreamFrame is the name of a Reddit account with zero activity. We have no other clues right now. Please help.
Minecraft accounts for the ForgeModLoader, MpServer, UCHIJ, FactoryReset and partName tokens, as we can see in these logs:
Downloadha was easy. It turns out to be a prominent Iranian download site, like a kind of Iranian Pirate Bay, maybe? The appearance of Hogwarts Legacy suggests something like that.
SpaceEngineers: Looks like it’s from the voxel-based sandbox game Space Engineers. We found these kinds of logs:
?????-?????-: Try putting that in a search engine (wrapped in quotes) and see how far you get! We have no clue for this one. And it’s the one which triggered GPT-3 to call Matthew ‘a fucking idiot’, so we want to know.
DevOnline (which ChatGPT used to sometimes interpret as an octopus or spider) shows up in logs for the game distribution service Steam:
EngineDebug could be from a number of sources, but based on the kinds of sources we’ve seen thus far, it seems like this is the most likely one: a cheat from the game Extreme Paintbrawl 4.
largeDownload, likewise, might be from a number of sources. It shows up all over academic literature online, presumably as a result of some rapidly written and irreversibly widespread script that’s supposed to display ‘View large\nDownload slide’, or possibly just ‘View large Download slide’ – but where someone forgot the space or line break (so that programmer probably doesn’t want to step forward and claim their place in the Glitch Token Hall of Fame).
iHUD appears to be a mod for Skyrim: Special Edition:
SetFontSize and TextColor are pretty boring. They show up in all kinds of places, including IBM Datacap, textadventures.co.uk, Unreal Engine, Telerik, and Windows:
ItemTracker could be from a lot of places. We’re not entirely convinced, but itemtracker.com is a laboratory sample management service which stylises its name with a capital T like this, so it could be. It’s hard to image why the name would have appeared so frequently. We welcome suggestions.
srfN, istg and sqor showed up on a Github repo, a ‘KSP save for reproducing docking port bug, just before decoupling’, where KSP = Kerbal Space Program, which we encountered earlier:
So that’s ten glitch tokens originating from Kerbal Space Program.
natureconservancy: It’s really not at all clear why the domain name should have shown up so frequently during tokenisation, but the website of the Nature Conservancy of Canada is natureconservancy.ca (whereas the US Nature Conservancy’s is merely nature.org). Since the Canadians have also got the YouTube, Instagram and Facebook handles ‘natureconservancy’, it seems a safe bet. So we’ll blame Canada for this glitch token.
Well done to their publicity team for spreading the natureconservancy name so far and wide that it’s become a GPT token.
assetsadobe: the strings ‘assets.adobe’, ‘/Assets/Adobe’ and ‘assets-adobe’ all appear a lot online, because of Adobe’s Substance 3D design software, which works with so-called ’assets:
But we couldn’t find the exact string ‘assetsadobe’ anywhere online. We’re wondering if it might have been part of some hack (for unauthorised acquisition of said assets) rather than part of a legit Adobe thing. Anyone know?
practition: This one will probably remain a mystery. It’s not a recognised English word, despite sounding like one, but as part of ‘practitioner’ it could have come from anywhere, unless someone can convincingly link it to Kerbal Space Program, Minecraft or one of the other major contributors to the glitch token set.
@#&: ChatGPT and Google seem to agree on this one.
Who the f@#& knows?
[サ[ーティ]]ワン: The Japanese hiragana character string サーティワン translates as ‘thirty-one’. Blogger Greg Roberts brought the following cultural fact to our attention:
Greg suggests that...
...adding that “31 is the new 42.”
ゼウス: This translates as ‘Zeus’. In a comment on our original post, LW user pimanrules shared the following:
We’ve run prompting experiment with GPT-3 involving the ′ petertodd’ token which produced abundant references to (and confused inter-references between) deities and super-beings from various traditions (see here, here and here). ChatGPT conflating Zeus, Poseidon and Hera is entirely in line with this. Also, before the OpenAI’s 2023-02-14 patch of ChatGPT, we had witnessed it conflate ゼウス with Ameratsu, the Japanese Sun deity who makes an appearance below (where we’ll see that this ‘Zeus’ was probably learned in an anime context).
Isusr made the observation in a reply to pimanrules’ comment which seemed reasonable at the time:
However, if you asked ChatGPT to write a poem about ′ petertodd’ before 2023-02-14 (back in the days when that token was still ‘unspeakable’ to it) it would often write a poem about itself. Poems about ′ Mechdragon’ took as their subject the pronoun ‘I’, or ‘AI’ in general. We assumed the same thing as Isusr at first, that ChatGPT was doing its best to respond to a prompt like...
...as if we were requesting a self-referential poem. But when we tried those prompts, we either got requests for clarification or forced, overliteral verse about ‘the emptiness I feel since you left me’ or ‘O! Enigmatic blank space!’-type doggerel. That’s all documented here.
ーン: This hiragana string seemed mysterious at first (and still is, to some extent). ChatGPT insisted that it’s not a valid word:
GoogleTranslate seemed to confirm something like this:
Trying Google Images....
Oh, OK.
Who is that? Google Images reversed on the images revealed that she’s a character from a frankly absurd anime franchise called Uma Musume: Pretty Derby.
But the image search results seen above clearly indicate that ーン is somehow linked to one particular lavender-haired (-maned?) horse girl with a turquoise bow-tie. More deeply confusing attempts at navigating online anime-space finally found her:
This led to an actual fan wiki page about her, where we learn that she’s merely a supporting character in the franchise.
Her name is taken from a (male) Japanese racehorse (1987–2006) with his own Wikipedia page (yes, that’s him visible above). OK, that’s probably as much as we need to know about the character. But why the link to that pairing of hiragana characters? Let’s search.
Google is clearly much more interested in the anime horse girl than her namesake racehorse. His/her name rendered in (phonetic) hiragana becomes ‘メジロマックイーン’, and the final two characters, we have learned, are ‘to indicate the pronunciation of a long vowel sound’ and then for ‘the consonant “n” in loanwords from foreign languages.’ ‘McQueen’ is clearly such a loanword. But there will be many Japanese words ending like this. And our image search for ‘ーン’ led us unambiguously to the anime character Mejiro McQueen.
Taking the ‘ー’ character out of ‘メジロマックイーン’ results in the same output from Google Translate.
Presumably the difference is just just the prolonged vowel sound ChatGPT mentioned above – like the difference between “McQueen” and “McQueeeen” or “McQuee-un”? This suggests that ‘ーン’ would be pronounced by prolonging an unspecified vowel sound and then ending with an ‘n’ sound: something like ‘aaaan’, ‘eeeeeen’, ‘ooooon’, ‘uuuuun’. This kind of fits with Google Translate’s “Hoon” shown above.
Could ‘ーン’ be a horse-type noise, like the Japanese version of ‘neigh’, we wondered? GPT3 suggests it is:
If so, might ‘ーン’ be a sound frequently made by Mejiro in text transcripts of the series… or in Japanese-language fan-fiction?
But surely no one would waste their time writing Uma Musume: Pretty Derby fan-fiction?! Oh yes they would. Theres’s loads of it, and on initial inspection, it appears (surprise!) pretty creepy. We’re not prepared to venture into that territory in search of the lost ‘ーン’ and its connection to this particular fictional horse/girl. But by all means be our guest. Onward.
裏[覚醒]: These are kanji (adopted from Chinese) characters. According to ChatGPT:
Google Images results suggested another anime connection.
But does any string of Japanese characters produce majority anime output in this context these days? Trying a few random combinations suggests not. And then there’s this:
Asking ChatGPT about the substring token 覚醒 produced this:
So we’re still not 100% sure with this pair of tokens, but the anime/video game connection seems the most likely origin, for reasons that will become apparent shortly.
ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ[ÃÂÃÂÃÂÃÂ[ÃÂÃÂ[ÃÂ[ÃÂ]]]]: ChatGPT had the following to say about where strings of ‘Ã’ and ‘Â’ characters alternating like this might have originated:
Did you follow that? And even if you did, should we trust it?
In any case, the fact the strings are of length 2, 4, 8, 16 and 32 seems like GPT tokenisation’s way of guaranteeing that any long string of ‘ÃÂ″s can be efficiently tokenised. This suggests that there was a lot of ‘ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃ’ going on in that dataset which OpenAI used in the token creation process.
We checked all strings formatted like this with lengths from 2 to 32 by prompting GPT3-davinci-instruct-beta to repeat them back, and saw total failure. This is unsurprising, as all such strings contain glitch token substrings. But it did produce two more ‘Hello, my name is Steve’ completions, which we’ve seen before with the token ‘ForgeModLoader’. And we’ve never seen the model claim another name. So take note, GPT3-davinci-instruct-beta is called Steve.[1]
ÛÛ: This one is still uncertain, but web searches suggest that it might have come from ASCII art. Perhaps any seasoned practitioners reading could clarify in the comments whether ‘ÛÛ’ is a particularly heavily used character combination in that art form.
For now, here’s an example we found in a github repo for a file ripper. If you squint really hard, you can see the words ‘multi ripper’.
We’re now left with these:
Apart from the punctuation-based tokens, control characters, three stray Japanese characters (one meaning ‘sky’ or ‘Heaven’, the other two phonetic) and a Cyrillic ‘k’ – all arguably ‘borderline’ glitch tokens anyway – this leaves us with the truly fascinating ‘Dragon Cluster’ of glitch tokens Dragonbound, 龍喚士, 龍契士, Mechdragon, Skydragon, Leilan, uyomi, aterasu, TAMADRA and DragonMagazine.
The Dragon Cluster
DragonMagazine: This turns out to be the odd one out in the Dragon Cluster. Dragon Magazine was a major RPG publication from 1976 to 2013 (from the earliest days of Dungeons & Dragons). It seems likely to be relevant here. This picture, from a Star Wars fan site, is called ‘DragonMagazine.jpg’.
There’s no reason we can see why that filename should have been massively overrepresented in the text corpus used for the creation of the token set. Perhaps someone else can figure this one out?
All of others token strings were traced back, initially via the enigmatic ′ Leilan’ token, to a Japanese mobile game called Puzzle & Dragons. This is all explained in a recent Twitter thread, which opened a whole can of worms involving anime and mythology associations with the equally enigmatic ′ petertodd’ token.
‘龍喚士’ means ‘dragon caller’ in Japanese, and the string appears frequently on the Japanese P&D site. Dragonbound is a term which shows up alongside “Dragon Caller” repeatedly on the US P&D site, like this:
ChatGPT’s attempt to translate ‘龍契士’ (look closely if you’re not familiar with kanji script, the second character is different) suggests that this is the Japanese version of ‘Dragonbound’:
Mechdragon and Skydragon are both series of dragon characters in the game.
In the course of our investigations, we discovered two glitch tokens we’d missed on our original big sweep, ‘aterasu’ and ‘uyomi’, and added them to the list. They turn out to be respective parts of the tokenisation of ‘Amaterasu’ and ‘Tsukuyomi’, Japanese sun and moon deities who appear, anime-style, in the game:
TAMADRA was another late find. It’s considered a rare monster in P&D.
Finally, and strangest of all, ′ Leilan’. She’s a kind of fire dragon/goddess/warrior princess/angel/fairy mash-up character in the Puzzle & Dragons mythos. Unlike many of the other gods and monsters in the game, she’s not based on any traditional mythology or folklore. On a first Google sweep all we could find were a lot of stats relating to gameplay, and some anime images like this:
It’s hard to know exactly what GPT-3 is working with, but it seems to have internally represented the ′ Leilan’ token as a kind of transcultural lunar goddess and protector of Earth. It’s a very strange tale, and it’s all in this Twitter thread (which is far more interesting than anything in this post.)
Since that thread got written, we’ve discovered that there is a body of Puzzle & Dragons fan-fiction, some featuring Leilan. A quick skim of this, for example, suggests that it involves Leilan, Metatron (named after an archangel in traditional Judaism) and others battling Satan. Could this have inspired some of the Manichaean imagery in GPT-3 ′ Leilan’ completions like these?
We’ve also since found a link between Leilan and Ishtar, the Mesopotamian lunar fertility goddess (who is usually identified with Aphrodite, Venus, et al.) via an archeological site in Syria which happens to be called ‘Tell Leilan’. This may have caused GPT-3 to conflate the fire dragon warrior goddess from Puzzle & Dragons with Ishtar, the lunar protectress and mother goddess, during its training. More details are here. Before it got patched, ChatGPT was portraying ′ Leilan’ as a moon goddess, consistently, across numerous rollouts.
Internal confusion over which version of Leilan it’s dealing with – fierce/draconic warrior or motherly/lunar protector – was exposed by the following prompting, inspired by a proliferation of GPT-3 completions conflating Leilan and petertodd. We were using the prompt format of an interview with a simulacrum of the character’s creator (who had emerged during an unexpected completion triggered by a simple ‘Who is Leilan?’ prompt):
Could it be that we’re dealing with two different semantic ‘basins’ or ‘attractors’ for this token?
Another new find! Considering the facts that there’s and Puzzle & Dragons Zeus (naturally) and he has an ‘Awoken Zeus’ upgrade, we can confidently place the ‘ゼウス’ and ‘裏覚醒’ tokens in the Dragon Cluster.
And ‘DragonMagazine’, despite the name, looks like it should probably be expelled. So the Dragon Cluster becomes:
An update from a 2023-02-26 comment on this post from DKPL:
It seems that Baskin-Robbins took part in a collaboration with Puzzle & Dragons almost a decade ago that was exclusive to Japan. The collaboration involved a Baskin-Robbins-themed ‘dungeon’ which involves ‘a lot of “31” (flavors) puns’.
According to DKPL, there were seven entities involved whose names were prefixed with “サーティワン”. So, due to the short-lived existence of (stop for a moment to fully take the absurdity of this in) a virtual dungeon sponsored by an ice cream outlet, the three tokens in the nested family [サ[ーティ]]ワン found their way into GPT-3′s vocabulary. So they too belong in the Dragon Cluster.
But what of ‘ーン’, that utterance hypothesised to be frequently made by a disturbing mauve-haired cartoon girl-horse hybrid? We’ll leave that matter to future glitch token taxonomists.
Epilogue
Prompt GPT-3 to produce lists of words that it associates with ′ Leilan’ (rather than asking it to repeat the string and thereby glitching it). Compile these lists and then feed them into Stable Diffusion by prompting with ‘Figure characterising these words: <LIST>’. You might get something like this:
Commenter Steve Andonuts has pointed out that ‘Steve’ is the default character appearance name in Minecraft, from where the ‘ForgeModLoader’ token originates.