I see your point. I think the existing tokenizer is designed to keep all parts of text, while the idea here is to sacrifice some information in favor of compression. But writing this, I also realized that this approach is more effective at saving characters than tokens.
I see your point. I think the existing tokenizer is designed to keep all parts of text, while the idea here is to sacrifice some information in favor of compression. But writing this, I also realized that this approach is more effective at saving characters than tokens.