P(doom) = 50%. It either happens, or it doesn’t.
Lao Mein | Statistics is Hard. | Patreon
I give full permission for anyone to post part or all of any of my comments/posts to other platforms, with attribution.
Currently doing solo work on glitch tokens and tokenizer analysis. Feel free to send me job/collaboration offers.
DM me interesting papers you would like to see analyzed. I also specialize in bioinformatics.
Potential token analysis tool idea:
Use the tokenizers of common LLMs to tokenize a corpus of web text (OpenWebText, for example), and identify the contexts in which they frequently appear, their correlation with other tokens, whether they are glitch tokens, ect. It could act as a concise resource for explaining weird tokenizer-related behavior to those less familiar with LLMs (e.g. why they tend to be bad at arithmetic) and how a token entered a tokenizer’s vocabulary.
Would this be useful and/or duplicate work? I already did this with GPT2 when I used it to analyze glitch tokens, so I could probably code the backend in a few days.