The key advance here seems to be the tokenizer, with larger vocabulary, which has been identified by others as a potentially critical limitation for GPT-3. I’d be very interested in seeing its performance on multi-digit addition tasks, for example.
I would be very interested in seeing how well this model works as a drop-in replacement for GPT-3 in various applications, both because it would undermine the market value of building AI systems which can be duplicated by others, and because it would say something about how flexible the architectures built around AI systems are to improvements.
The key advance here seems to be the tokenizer, with larger vocabulary, which has been identified by others as a potentially critical limitation for GPT-3.
My impression was that tokenization was a critical limitation for GPT-3 in the opposite direction, i.e. it caused GPT-3′s performance to suffer on tasks where character-level information is important (including multi-digit addition, and also things like rhyming, acronyms, etc), because the tokenization process clumps characters together by default and obscures that information. Having more (and longer) tokens does not seem like it would remedy that issue; if anything, it may exacerbate it.
Ahh. I was aware that there was an issue, but I definitely didn’t understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn’t help. (And given that, I wonder what good tokenizer development looks like—because I presume it’s hard to optimize alongside the model itself.)
One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - ‘0’, and ‘1’.
All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some ‘thinking’ in advance).
I wonder if there have been any experiments with feeding transformers just straight binary info. I’m guessing it hasn’t been done in this context due to potential context length limitations?
It’s both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.
You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.
The key advance here seems to be the tokenizer, with larger vocabulary, which has been identified by others as a potentially critical limitation for GPT-3. I’d be very interested in seeing its performance on multi-digit addition tasks, for example.
I would be very interested in seeing how well this model works as a drop-in replacement for GPT-3 in various applications, both because it would undermine the market value of building AI systems which can be duplicated by others, and because it would say something about how flexible the architectures built around AI systems are to improvements.
My impression was that tokenization was a critical limitation for GPT-3 in the opposite direction, i.e. it caused GPT-3′s performance to suffer on tasks where character-level information is important (including multi-digit addition, and also things like rhyming, acronyms, etc), because the tokenization process clumps characters together by default and obscures that information. Having more (and longer) tokens does not seem like it would remedy that issue; if anything, it may exacerbate it.
Ahh. I was aware that there was an issue, but I definitely didn’t understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn’t help. (And given that, I wonder what good tokenizer development looks like—because I presume it’s hard to optimize alongside the model itself.)
One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - ‘0’, and ‘1’.
All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some ‘thinking’ in advance).
I wonder if there have been any experiments with feeding transformers just straight binary info. I’m guessing it hasn’t been done in this context due to potential context length limitations?
It’s both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.
You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.
I agree.