Is there a reason why every LLM tokenizer I’ve seen excludes slurs? It seems like a cheap way to train for AI assistant behavior.
Also notable that numbers are tokenized individually—I assume this greatly improves its performance in basic arithmetic tasks as compared to GPTs.
That’s interesting, if true. Maybe the tokeniser was trained on a dataset that had been filtered for dirty words.
Is there a reason why every LLM tokenizer I’ve seen excludes slurs? It seems like a cheap way to train for AI assistant behavior.
Also notable that numbers are tokenized individually—I assume this greatly improves its performance in basic arithmetic tasks as compared to GPTs.
That’s interesting, if true. Maybe the tokeniser was trained on a dataset that had been filtered for dirty words.