I have a pretty good lead on where “cffff” came from.
Asides from random hexidecimal in code and databases (a surprising proportion of which were password hashes on breach forums), it’s part of several World of Warcraft chat commands.
It also looks like a common part of WOW auction logs, which seem like the exact type of thing to get included for making the tokenizer but excluded from training.
No leads on “cfffcc” though—there were 0 instances of it in OpenWebText. Not sure what this means.
In case anyone’s curious, using \124 in a Lua string literal is a decimal escape for “|” (VERTICAL BAR), which looks to be used as a control sequence introducer here. I assume the subsequent character “c” represents a “color” command followed by a 32-bit color: FFD000 is 24-bit RGB for an orange color that looks like the text depicted on that page, with the preceding FF probably meaning full opacity.
I have a pretty good lead on where “cffff” came from.
Asides from random hexidecimal in code and databases (a surprising proportion of which were password hashes on breach forums), it’s part of several World of Warcraft chat commands.
For example, “124cffffd000” is used apparently used as part of a command to change chat text color?
It also looks like a common part of WOW auction logs, which seem like the exact type of thing to get included for making the tokenizer but excluded from training.
No leads on “cfffcc” though—there were 0 instances of it in OpenWebText. Not sure what this means.
In case anyone’s curious, using \124 in a Lua string literal is a decimal escape for “|” (VERTICAL BAR), which looks to be used as a control sequence introducer here. I assume the subsequent character “c” represents a “color” command followed by a 32-bit color: FFD000 is 24-bit RGB for an orange color that looks like the text depicted on that page, with the preceding FF probably meaning full opacity.