Ahh. I was aware that there was an issue, but I definitely didn’t understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn’t help. (And given that, I wonder what good tokenizer development looks like—because I presume it’s hard to optimize alongside the model itself.)
One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - ‘0’, and ‘1’.
All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some ‘thinking’ in advance).
I wonder if there have been any experiments with feeding transformers just straight binary info. I’m guessing it hasn’t been done in this context due to potential context length limitations?
It’s both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.
You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.
Ahh. I was aware that there was an issue, but I definitely didn’t understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn’t help. (And given that, I wonder what good tokenizer development looks like—because I presume it’s hard to optimize alongside the model itself.)
One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - ‘0’, and ‘1’.
All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some ‘thinking’ in advance).
I wonder if there have been any experiments with feeding transformers just straight binary info. I’m guessing it hasn’t been done in this context due to potential context length limitations?
It’s both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.
You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.