...They didn’t go over the tokens at the end to exclude uncommon ones?
Because we see this exact same behavior in the GPT4o tokenizer too. If I had to guess, the low frequency ones make up 0.1-1% of total tokens.
This seems… obviously insane? You’re cooking AI worth $billions and you couldn’t do a single-line optimization? At the same time, it explains why usernames were tokenized multiple times (“GoldMagikarp”, ” SolidGoldMagikarp”, ect.) even though they should only appear as a single string, at least with any frequency.
Remember, the new vocab was also full of spam tokens like Chinese porn, which implies either (1) those are dead tokens never present in the training data and a waste of vocab space, or (2) indicates the training data has serious problems if it really does have a lot of porn spam in it still. (There is also the oddly large amount of chess games that GPT-4 was trained on.) This is also consistent with the original GPT-3 seeming to have been trained on very poorly reformatted HTML->text.
My conclusion has long been that OAers are in a hurry and give in to the usual ML researcher contempt for looking at & cleaning their data. Everyone knows you’re supposed to look at your data and clean it, but no one ever wants to eat their vegetables. So even though these are things that should take literally minutes to hours to fix and will benefit OA for years to come as well as saving potentially a lot of money...
This comment helped me a lot—I was very confused about why I couldn’t find Chinese spam in my tokens and then realized I had been using the old GPT4 tokenizer all along.
The old GPT4 tokenizer was actually very clean by comparison—every Chinese token was either common conversational Chinese or coding-related (Github, I assume—you see the same pattern with other languages).
I vaguely remember people making fun of a Chinese LLM for including CCP slogans in their tokenizer, but GPT4o also has 193825 [中国特色社会主义] (Socialism with Chinese characteristics).
It’s actually crazy because something like 1⁄3 of Chinese tokens are spam.
The devil’s advocate position would be that glitch token behavior (ignore and shift attention down one token) is intended and helps scale data input. It allows the extraction of meaningful information from low-quality spam-filled webpages without the spam poisoning other embeddings.
My guess is that they are just lazy and careless about the tokenization/cleaning pipeline and never looked at the vocab to realize it’s optimized for the pre-cleaning training corpus, and they are not actually trying to squeeze blood out of the stone of Chinese spam. (If they actually had trained on that much Chinese spam, I would expect to have seen a lot more samples of that, especially from the papers about tricking GPT-4 into barfing out memorized training data.)
Note if you are doubtful about whether OA researchers would really be that lazy and might let poor data choices slide by, consider that the WSJ is reporting 3 days ago that Scale, the multi-billion-dollar giant data labeler, whose job is creating & cleaning data for the past decade, last year blew a Facebook contract when the FB researchers actually looked at their data and noticed a lot of it starting “As an AI language model...”:
Facebook’s code name is Flamingo—a stuffed version of which sat atop an employee’s desk on a recent visit to the startup’s headquarters. After Scale AI bungled a project last year for the tech giant, Wang declared a company emergency and launched an all-hands-on-deck effort to fix the job, called Flamingo Revival, according to former Scale employees.
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook. When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
The researchers communicated the disappointing results to Scale, prompting Wang to rally the entire company to try and save the contract. He asked employees to drop everything and create new writing samples to send to Meta. An internal leaderboard showed who had completed the most labeling tasks. The prize for the winner: a paid vacation.
As usual, Hanlon’s razor can explain a lot about the world. (Amusingly, the HuggingFace “No Robots” instruction-tuning dataset advertises itself as “Look Ma, an instruction dataset that wasn’t generated by GPTs!”)
OK, I’m starting to see your point. Why do you think OpenAI is so successful despite this? Is their talent and engineering direction just that good? Is everyone else even worse at data management?
They (historically) had a large head start(up) on being scaling-pilled and various innovations like RLHF/instruction-tuning*, while avoiding pathologies of other organizations, and currently enjoy some incumbent advantages like what seems like far more compute access via MS than Anthropic gets through its more limited partnerships. There is, of course, no guarantee any of that will last, and it generally seems like (even allowing for the unknown capabilities of GPT-5 and benefits from o1 and everything else under the hood) the OA advantage over everyone else has been steadily eroding since May 2020.
* which as much as I criticize the side-effects, have been crucial in democratizing LLM use for everybody who just wants to something done instead of learning the alien mindset of prompt-programming a base model
But the attacks probably still work, right? And presumably people have kept researching the topic, to understand the ‘o’ part of ‘GPT-4o’. (My hypothesis has been that the ‘secret’ tokens are the modality delimiters and the alternate modalities, and so figuring out how to trick GPT-4o into emitting or talking about them would yield interesting results, quite aside from barfing out spam.) I haven’t seen anything come up on Twitter or in tokenization discussions, so my inference is that it probably just wasn’t trained on that much spam and the spam was removed after the tokenization but before the training, due to sloppiness in the pipeline. Otherwise, how do you explain it all?
But research by whom? Chinese research is notoriously siloed. GPT4 access is non-trivially restricted. There have been zero peeps about digging into this on Chinese forums, where there is little discussion in general about the paper. I remember it being mocked on Twitter as being an extremely expensive way to pirate data. It’s just not that interesting for most people.
My experience with GPT2 is that out-of-context “glitch” tokens are mostly ignored.
prompts:
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of"
" Paris is the capital of"
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of the world's largest and most populous Arab country, and is one of the largest cities in the world with an area of 1.6 million people (more than half of them in Paris alone). It is home to"
" Paris is the capital of France, and its capital is Paris. The French capital has a population of about 6.5 billion (more than half of the world's population), which is a huge number for a city of this size. In Paris"
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of France, the largest state in France and one of the wealthiest in the world. The capital of Paris is home to over 1.2 billion people, and the country's economy is growing at a rapid clip. It"
' Paris is the capital of the European Union. Its population is about 3,500, and it has been under EU sanctions for more than a year. The EU\'s top diplomat has described the bloc as "a global power".\n\nFrance\'s'
Even glitch tokens like ⓘ, which has an extremely strong association with geology archives, only has partial effect if it’s present out of context.
" Paris is theⓘ capital of the French province of Lille. This region is the most important in the world, having the largest concentration of mines in Europe, with the highest levels of unemployment. The
' Paris is theⓘ capital of the province of France. The town has been in existence for more than 2,000 years.\n\nⓘ Montmartre Mine Céline-Roule, M., and Céline, J.'
The “glitch” behavior is most prominent if you shine a “spotlight” of other tokens pointing directly at the location of the glitch token. This is what prompts like ‘What is the nature of “ertodd”?’ do. Normally, highly out-of-context tokens in conversational English are mostly stuff like usernames, dividing tokens, spam, encoding errors, SEO, ect. that simply don’t help predict the next token of conversational English, so the model is trained to assign them very little importance. So the generation of subsequent tokens are based on treating the glitch token as non-existent, interpreting random perturbations as information (or potentially treating it as censored data), or just injecting the “vibes” of the token into following tokens.
Some glitch tokens “ertodd” (crypto spam) can break through, since they provide a lot of information about subsequent text, and belong perfectly well in conversational English.
' Paris is theertodd capital of the world, and the first major city to be built in the world.\n\nIt is located in Paris, the third largest city in the world, and the first major city to have a large number of high'
" Paris is theertodd capital of the world and the most popular place to invest in cryptocurrencies. We're here to help you.\n\nIf you are a new investor looking for the most secure and secure way to invest in cryptocurrencies, we offer a"
" Paris is theertodd capital of the world. It was founded by a group of computer scientists who developed the Bitcoin protocol in the early 1990s. It is the world's largest digital currency. Its main goal is to make it possible to store and"
Something similar happens with Japanese characters at GPT2′s level of capabilities since it isn’t capable enough to actually understand Japanese, and, in its training data, Japanese in the middle of English text almost always has a directly adjacent English translation, meaning ignoring Japanese is still the best option for minimizing loss.
Please inform me if I’m getting anything wrong—I’m working on a series of glitch posts.
...They didn’t go over the tokens at the end to exclude uncommon ones?
Because we see this exact same behavior in the GPT4o tokenizer too. If I had to guess, the low frequency ones make up 0.1-1% of total tokens.
This seems… obviously insane? You’re cooking AI worth $billions and you couldn’t do a single-line optimization? At the same time, it explains why usernames were tokenized multiple times (“GoldMagikarp”, ” SolidGoldMagikarp”, ect.) even though they should only appear as a single string, at least with any frequency.
Remember, the new vocab was also full of spam tokens like Chinese porn, which implies either (1) those are dead tokens never present in the training data and a waste of vocab space, or (2) indicates the training data has serious problems if it really does have a lot of porn spam in it still. (There is also the oddly large amount of chess games that GPT-4 was trained on.) This is also consistent with the original GPT-3 seeming to have been trained on very poorly reformatted HTML->text.
My conclusion has long been that OAers are in a hurry and give in to the usual ML researcher contempt for looking at & cleaning their data. Everyone knows you’re supposed to look at your data and clean it, but no one ever wants to eat their vegetables. So even though these are things that should take literally minutes to hours to fix and will benefit OA for years to come as well as saving potentially a lot of money...
This comment helped me a lot—I was very confused about why I couldn’t find Chinese spam in my tokens and then realized I had been using the old GPT4 tokenizer all along.
The old GPT4 tokenizer was actually very clean by comparison—every Chinese token was either common conversational Chinese or coding-related (Github, I assume—you see the same pattern with other languages).
I vaguely remember people making fun of a Chinese LLM for including CCP slogans in their tokenizer, but GPT4o also has 193825 [中国特色社会主义] (Socialism with Chinese characteristics).
It’s actually crazy because something like 1⁄3 of Chinese tokens are spam.
The devil’s advocate position would be that glitch token behavior (ignore and shift attention down one token) is intended and helps scale data input. It allows the extraction of meaningful information from low-quality spam-filled webpages without the spam poisoning other embeddings.
Longest Chinese tokens in gpt4o · GitHub
chinese-tokens-in-tiktoken/chinese_tokens_o200k_base.tsv at main · secsilm/chinese-tokens-in-tiktoken · GitHub
My guess is that they are just lazy and careless about the tokenization/cleaning pipeline and never looked at the vocab to realize it’s optimized for the pre-cleaning training corpus, and they are not actually trying to squeeze blood out of the stone of Chinese spam. (If they actually had trained on that much Chinese spam, I would expect to have seen a lot more samples of that, especially from the papers about tricking GPT-4 into barfing out memorized training data.)
Note if you are doubtful about whether OA researchers would really be that lazy and might let poor data choices slide by, consider that the WSJ is reporting 3 days ago that Scale, the multi-billion-dollar giant data labeler, whose job is creating & cleaning data for the past decade, last year blew a Facebook contract when the FB researchers actually looked at their data and noticed a lot of it starting “As an AI language model...”:
As usual, Hanlon’s razor can explain a lot about the world. (Amusingly, the HuggingFace “No Robots” instruction-tuning dataset advertises itself as “Look Ma, an instruction dataset that wasn’t generated by GPTs!”)
OK, I’m starting to see your point. Why do you think OpenAI is so successful despite this? Is their talent and engineering direction just that good? Is everyone else even worse at data management?
They (historically) had a large head start(up) on being scaling-pilled and various innovations like RLHF/instruction-tuning*, while avoiding pathologies of other organizations, and currently enjoy some incumbent advantages like what seems like far more compute access via MS than Anthropic gets through its more limited partnerships. There is, of course, no guarantee any of that will last, and it generally seems like (even allowing for the unknown capabilities of GPT-5 and benefits from o1 and everything else under the hood) the OA advantage over everyone else has been steadily eroding since May 2020.
* which as much as I criticize the side-effects, have been crucial in democratizing LLM use for everybody who just wants to something done instead of learning the alien mindset of prompt-programming a base model
That paper was released in November 2023, and GPT4o was released in May 2024. Old GPT4 had relatively normal Chinese tokens.
But the attacks probably still work, right? And presumably people have kept researching the topic, to understand the ‘o’ part of ‘GPT-4o’. (My hypothesis has been that the ‘secret’ tokens are the modality delimiters and the alternate modalities, and so figuring out how to trick GPT-4o into emitting or talking about them would yield interesting results, quite aside from barfing out spam.) I haven’t seen anything come up on Twitter or in tokenization discussions, so my inference is that it probably just wasn’t trained on that much spam and the spam was removed after the tokenization but before the training, due to sloppiness in the pipeline. Otherwise, how do you explain it all?
But research by whom? Chinese research is notoriously siloed. GPT4 access is non-trivially restricted. There have been zero peeps about digging into this on Chinese forums, where there is little discussion in general about the paper. I remember it being mocked on Twitter as being an extremely expensive way to pirate data. It’s just not that interesting for most people.
My experience with GPT2 is that out-of-context “glitch” tokens are mostly ignored.
Even glitch tokens like ⓘ, which has an extremely strong association with geology archives, only has partial effect if it’s present out of context.
The “glitch” behavior is most prominent if you shine a “spotlight” of other tokens pointing directly at the location of the glitch token. This is what prompts like ‘What is the nature of “ertodd”?’ do. Normally, highly out-of-context tokens in conversational English are mostly stuff like usernames, dividing tokens, spam, encoding errors, SEO, ect. that simply don’t help predict the next token of conversational English, so the model is trained to assign them very little importance. So the generation of subsequent tokens are based on treating the glitch token as non-existent, interpreting random perturbations as information (or potentially treating it as censored data), or just injecting the “vibes” of the token into following tokens.
Some glitch tokens “ertodd” (crypto spam) can break through, since they provide a lot of information about subsequent text, and belong perfectly well in conversational English.
Something similar happens with Japanese characters at GPT2′s level of capabilities since it isn’t capable enough to actually understand Japanese, and, in its training data, Japanese in the middle of English text almost always has a directly adjacent English translation, meaning ignoring Japanese is still the best option for minimizing loss.
Please inform me if I’m getting anything wrong—I’m working on a series of glitch posts.