New GPT-3 competitor
AI21 has trained a new language model, Jurassic-1, whose largest version has 178 billion parameters (GPT-3 had 175 billion). This paper gives limited technical details.
There already were several models that used far more parameters than GPT-3, but they were either mixture of expert models or only word embeddings. They required much less compute to train/use, but were less powerful than a dense transformer like GPT-3 or the new Jurassic-1.
The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3. It has a larger vocabulary and slightly optimized architecture. Jurassic-1 only has a bit more parameters than GPT-3, whereas prior trends indicated that any GPT-3 successor would use at least an order of magnitude more parameters. Since GPT-3, much work has gone towards improving transformer architecture (e.g., linear time self attention and neural architecture search), but little of that is visible in Jurassic-1. Maybe companies don’t think it’s economically viable to scale beyond GPT-3 or run many experiments with different architectures at that scale?
Also, Jurassic-1 is a unidirectional model, like GPT-3 (meaning it’s forced to process text from left-to-right). This means GPT-3 can only process a given word using the context provided by the previous words. This causes unidirectional models problems for most tasks other than text generation. For example, other than GPT-3, all the top models in the SuperGLUE benchmark leaderboard are bidirectional models. It’s interesting AI21 chose to compete with OpenAI using a model that provides the same class of service (text generation) as GPT-3, rather than specialize in, e.g., text classification, where a bidirectional model would be better.
No, the interesting thing is that it’s available as a public API. It took 13 months for an OA API competitor to emerge, but now it’s here and the OA API has a real competitor, and someone who will be happy to pick up many of the customers OA has driven away with its increasingly heavy-handed, arbitrary, and last-minute restrictions. (The tokenizer and better width vs depth scaling is trivial by comparison.)
The models came before, but not an API/SaaS. GPT-3 was already matched/exceeded by the dense models HyperClova & PanGu-α, and possibly MUM/LaMDA/Pathways/the Wu Daos*, but none of those are meaningfully publicly accessible, and so came and went. Jurassic-1 is available as an API, and is even free right now. That is very different, in much the same way that GPT-J is being so heavily used by everyone locked out of the OA API because it is available for free. “Free [public] is different.”
* details are sparse on all these, including the nature of any sparsity
It seems one can’t use Jurassic-1 without giving AI21 both your email address and your phone number. (For “validation”, but e.g. their “privacy policy” flat-out lies about what personal information they collect—it doesn’t include the phone number—so I don’t see any reason to treat it as meaningfully constraining what they might do with that information.)
The foregoing is not intended to express any judgement as to whether you should or shouldn’t care about this.
Well, an e-mail address and a phone number. Whether that’s identifying data is up to you (and to some extent, your jurisdiction and how easy it is to get an anonymous cash-paid SIM).
The key advance here seems to be the tokenizer, with larger vocabulary, which has been identified by others as a potentially critical limitation for GPT-3. I’d be very interested in seeing its performance on multi-digit addition tasks, for example.
I would be very interested in seeing how well this model works as a drop-in replacement for GPT-3 in various applications, both because it would undermine the market value of building AI systems which can be duplicated by others, and because it would say something about how flexible the architectures built around AI systems are to improvements.
My impression was that tokenization was a critical limitation for GPT-3 in the opposite direction, i.e. it caused GPT-3′s performance to suffer on tasks where character-level information is important (including multi-digit addition, and also things like rhyming, acronyms, etc), because the tokenization process clumps characters together by default and obscures that information. Having more (and longer) tokens does not seem like it would remedy that issue; if anything, it may exacerbate it.
Ahh. I was aware that there was an issue, but I definitely didn’t understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn’t help. (And given that, I wonder what good tokenizer development looks like—because I presume it’s hard to optimize alongside the model itself.)
One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - ‘0’, and ‘1’.
All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some ‘thinking’ in advance).
I wonder if there have been any experiments with feeding transformers just straight binary info. I’m guessing it hasn’t been done in this context due to potential context length limitations?
It’s both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.
You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.
I agree.