I am not 100% sure I did that right but that seems like a more sensible answer.
Eyy, I should trust myself more. Verified on Pile-CC.
(GPT-2/3 BPE)
>>> k = 100000000; k / len(tokenizer(cc[:k])["input_ids"])
4.355680325470372
(T5 sentencepiece)
>>> k = 10000000; k / len(tokenizer(cc[:k])["input_ids"])
4.182535904979476
Eyy, I should trust myself more. Verified on Pile-CC.