Paging Gwern or anyone else who can shed light on the current state of the AI market—I have several questions.
Since the release of ChatGPT, at least 17 companies, according to the LMSYS Chatbot Arena Leaderboard, have developed AI models that outperform it. These companies include Anthropic, NexusFlow, Microsoft, Mistral, Alibaba, Hugging Face, Google, Reka AI, Cohere, Meta, 01 AI, AI21 Labs, Zhipu AI, Nvidia, DeepSeek, and xAI.
Since GPT-4’s launch, 15 different companies have reportedly created AI models that are smarter than GPT-4. Among them are Reka AI, Meta, AI21 Labs, DeepSeek AI, Anthropic, Alibaba, Zhipu, Google, Cohere, Nvidia, 01 AI, NexusFlow, Mistral, and xAI.
Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources, has somehow built the third smartest AI in the world, apparently on par with the very best from OpenAI.
The top AI image generator, Flux AI, which is considered superior to the offerings from OpenAI and Google, has no Wikipedia page, barely any information available online, and seemingly almost no employees. The next best in class, Midjourney and Stable Diffusion, also operate with surprisingly small teams and limited resources.
I have to admit, I find this all quite confusing.
I expected companies with significant experience and investment in AI to be miles ahead of the competition. I also assumed that any new competitors would be well-funded and dedicated to catching up with the established leaders.
Understanding these dynamics seems important because they influence the merits of things like a potential pause in AI development or the ability of China to outcompete the USA in AI. Moreover, as someone with general market interests, the valuations of some of these companies seem potentially quite off.
So here are my questions:
1. Are the historically leading AI organizations—OpenAI, Anthropic, and Google—holding back their best models, making it appear as though there’s more parity in the market than there actually is?
2. Is this apparent parity due to a mass exodus of employees from OpenAI, Anthropic, and Google to other companies, resulting in the diffusion of “secret sauce” ideas across the industry?
3. Does this parity exist because other companies are simply piggybacking on Meta’s open-source AI model, which was made possible by Meta’s massive compute resources? Now, by fine-tuning this model, can other companies quickly create models comparable to the best?
4. Is it plausible that once LLMs were validated and the core idea spread, it became surprisingly simple to build, allowing any company to quickly reach the frontier?
5. Are AI image generators just really simple to develop but lack substantial economic reward, leading large companies to invest minimal resources into them?
6. Could it be that legal challenges in building AI are so significant that big companies are hesitant to fully invest, making it appear as if smaller companies are outperforming them?
7. And finally, why is OpenAI so valuable if it’s apparently so easy for other companies to build comparable tech? Conversely, why are these no name companies making leading LLMs not valued higher?
Of course, the answer is likely a mix of the factors mentioned above, but it would be very helpful if someone could clearly explain the structures affecting the dynamics highlighted here.
I’ll try to address just these two points with what I know, which is limited.
OpenAI, Anthropic, Google / Google Deepmind, Meta, AWS, and NVIDIA are the main current holders of AI compute hardware. Of these, the majority of frontier engineering talent (and expressed intent to race for AGI) seems to be concentrated in OpenAI, Anthropic, and Google / Google Deepmind. I think of these as a group I call ‘the Top 3’.
So far as the public knows, the Top 3 are working hard on preparing the next generation of LLMs (now all multimodal, so LLM is a bit of a misnomer). Preparing a next generation takes time and effort from the researchers, plus engineering support and training time on the large clusters. We are roughly halfway through the expected interval between full-step versions (e.g. GPT-4 to GPT-5). Of these companies, my best guess is that OpenAI has a slight lead, and thus will probably deploy their next gen (i.e. GPT-5) before the other companies deploy theirs (Claude 4 Opus, Google Gemini 2 Ultra). The time lag may be a couple of weeks or as much as six months. Hard to say for sure. Google has a lot of resources, and so might slightly beat OpenAI.
In any case, the big differentiator in terms of the Top 3 versus the Rest is the massive amount of hardware. GPT-4 level was something that could be trained on a wide variety of different rentable resources. The scaling trend suggests that the next level up, GPT-5, will require more resources than the Rest are expected to be able to muster this year. Since hardware is advancing, some members of the Rest may acquire GPT-5 level resources late next year (2025) or early 2026. This means they’ll be quite a few months behind. As for the implied resources needed for GPT-6 level if scaling trends and costs continue on trend, it seems unlikely that most of the Rest will be able to afford to scale that large (even acting late, and with rented resources) with the exception of Meta and possibly xAI.
Current opinion, which I agree with, puts Claude 3.5 Sonnet at about 4.25 − 4.3 level compared to GPT-4. All of the others (including GPT-4o) fall somewhere in-between level 4 and level 4.3
Nobody has a level 5 yet because even the leaders haven’t had the time to create and deploy it yet!
As for “secret sauce” ideas, we haven’t yet gotten public knowledge about any secret knowledge being a hard blocker. It does seem like there is a fair amount of small technical secrets which improve compute efficiency or capabilities in minor ways, but that these can all be compensated for by spending more on compute or coming up with alternate approaches. There is a huge amount of ML research being published every month now because the field has gotten so lucrative and trendy. The newest public research isn’t yet present in the deployed models because the models were trained and deployed before the new research was published!
This makes for a strange dynamic that those companies which are lagging slightly behind in getting their large training runs started get the advantage of later cut-off date for incorporation of public research. I think that getting to incorporate more of the latest research is a significant part of the explanation for why the GPT-4-sized models of late-movers have slightly surpassed GPT-4.
This doesn’t imply that OpenAI is losing the race, or that they don’t have valuable technical secrets. The Top 3 are still in the lead, so far as I can foresee, they just are in the ‘hidden progress’ phase which comes between model generations. Because of this, we can’t know their relative standing for certain. Presumably, even they don’t know since they don’t have the details on the secret tech that their competitors are putting into their next generation. We will need to wait and see how the next generation of the Top 3′s models compare to each other.
Unclear if going beyond GPT-5 will be crucial, at that point researchers might get more relevant than compute again. GPT-4 level models (especially the newer ones) have the capability to understand complicated non-specialized text (now I can be certain some of my more obscure comments are Objectively Understandable), so GPT-5 level models will understand very robustly. If this is sufficient signal to get RL-like things off the ground (automating most labeling with superhuman quality, usefully scaling post-training to the level of pre-training), more scale won’t necessarily help on the currently-somewhat-routine pre-training side.
I think a little more explanation is required on why there isn’t already a model with 5-10x* more compute than GPT-4 (which would be “4.5 level” given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up).
You’d need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4. This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.
*I’m not totally sure the 5x threshold (1e26 FLOP) hasn’t been breached but most people think it hasn’t.
GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
With 100K H100s, 1 month at 30% utilization gets you 8e25 FLOPs. OpenAI might have obtained this kind of training compute in May 2024, and xAI might get it at the end of 2024. AWS announced access to clusters with 20K H100s back in July 2023, which is 2e25 FLOPs a month at 40% utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
The current notes for the EpochAI estimate are linked from the model database csv file:
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now. I don’t have a definite answer to this, but I have some guesses. It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP: multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).
The release of Llama 405b was the thing that most succinctly explained this to me. At least when it comes to the current generation of cutting edge LLMs, there is no secret sauce. Llama 405b is a cutting edge model with, as far as I can tell, no advances in architecture or training compared to the development of GPT-3. Indeed, it appears in architecture substantially simpler than GPT-4 while outperforming it, suggesting that in the long-run, simplicity of architecture tends to win out, especially if you are willing to take a relatively small (<3x) compute-cost hit.
The architecture is a straightforward transformer with no mixture of experts or anything fancy:
The training process did nothing interesting. It used the most obvious implementation of supervised fine-tuning and reinforcement training.
The data cleaning process was somewhat more involved, and we know less about, but I think is unlikely to have done anything like synthetic data generation or complicated AI-assisted review.
This might all again change with the next generation of LLMs (especially with things like Strawberry, which looks like it might do something more interesting), but at least right now, I think almost any competent engineering team in the world could build a cutting-edge AI model, if they were just willing to spend the compute. It requires overcoming some minor engineering challenges, but the basics of how to do this are figured out. There is no moat.
Llama 405B was trained on a bunch of synthetic data in post-training for coding, long-context prompts, and tool use (see section 4.3 of the paper).
@ryan_greenblatt: Curious if you have a quick example of an architectural change from GPT-3. Quick googling/perplexing maybe suggests some changes in the attention algorithm (grouped-query attention instead of whatever GPT-3 was doing).
I was trying to just highlight “training” rather than architecture. I think there are architecture changes (swigelu, grouped-query attention, probably somewhat better tuned transformer hparams like layer count etc.) though these are perhaps minor.
My understanding of the key training advances relative to GPT3:
Closer to chinchilla optimal via having enough data. (I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data.)
Better data. The paper says “Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training.”
They did the Chinchilla scaling experiments themselves, it’s in the report (Section 3.2.1 Scaling Laws). The result claims that 40 tokens/parameter is actually optimal in their setup (2x more than in the Chinchilla paper), so Llama-3-405b is Chinchilla optimal in the relevant sense, it’s not trained on too much data. The result is slightly suspicious in that their largest datapoints are 1e22 FLOPs, while Llama-3-405b itself is 4e25 FLOPs, so that’s a lot of extrapolation. But overall they find that the optimal tokens/parameter ratio increases with compute, more so than in the Chinchilla paper, and Llama-3-405b had more compute than Chinchilla.
This is also consistent with the CARBS experiments done by Imbue (search for “tokens per parameter”):
Ah, sorry, yeah, I basically agree with this. I do think the scaling law stuff made a big difference. I commented a bit on the training data stuff, but my best guess is the changes there are also minor (besides the sheer volume).
Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.
Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It’s very easy to mislead with statistics—for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.
That’s true, though I do think there are various proxies that make at least the extreme end of this kind of thing for currently deployed models relatively easy to rule out (like the compute-purchase and allocation decisions of major cloud providers who host some of these models, and staff allocation and various other things).
I do think most organizations who claim parity with GPT-4 or Sonnet are almost always overstating things. My experience with 405b suggests it is also not at the level of Claude 3.5 Sonnet, but it does seem to be at the level of the original GPT-4, though I am not confident since I haven’t played around that much with it GPT-4 recently.
Yeah, I mostly agree. I would say that there may or may not be certain secret techniques which will give models a slightly lower loss plateau for a given parameter count. That matters more to the large companies than compute efficiency, I think.
Accumulate enough loss-plateau-lowering tidbits, and it could add up to having the best model out of a group of similarly sized models.
Both of these seem false.
Re: talent, see from their website:
They don’t list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.
Re: resources, according to Elon’s early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, according to SemiAnalysis). And xAI was working on a 100k H100 cluster that was on track to be finished in July. Also they raised $6B in May.
According to DCD, that should be fall 2025. Planned power is 150 megawatts or possibly 50+150 megawatts, which is good for 100K H100s, but not more than that. The request for the 150 megawatts is still being discussed by the utilities, as of August 2024. Any future Blackwells will need to go elsewhere, the whole plan for this datacenter seems to be the 100K H100s. (This costs about $5bn, and xAI only closed its $6bn Series B in May 2024.)
This scale seems to be available from AWS, and takes about a month to invest GPT-4 levels of compute. Grok-2 was probably rushed, once it was ready to train, in order to finally get a 4-level model, so it didn’t train for very long. If 100K H100s clusters remain impossible to access, and the full Memphis datacenter won’t get online at least for months yet (with significantly more H100s than 24K), it seems that the reasonable thing right now is to simply train on 24K H100s for more months. That’s probably going to be Grok-3.
Unless Elon is lying, it was operational as of July, though perhaps only with about 32k of the H100s rather than all of them. My understanding is that at least 64k are operational now.
Yes, though mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
Seems to be fully online as of now (Sep. 2) based on this tweet?
I now think this is false. From The Information:
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.
No. There isn’t much “secret sauce”, and these companies never had a large amount of AI talent to begin with. Their advantage is being in a position with hype/reputation/size to get to market faster. It takes several months to setup the infrastructure (getting money, data, and compute clusters), but that’s really the only hurdle.
No. “Everyone” in the AI research community knew how to build Llama, multi-modal models, or video diffusion models a year before they came out. They just didn’t have $10M to throw around.
Also, fine-tuning isn’t really the way to go. I can imagine people using it as a teacher during the warming up phase, but the coding infrastructure doesn’t really exist to fine-tune or integrate another model as part of a larger one. It’s usually easier to just spend the extra time securing money and training.
Yep. Even five years ago you could open a Colab notebook and train a language translation model in a couple of minutes.
No, images are much harder than language. With language models, you can exactly model the output distribution, while the space of images is continuous and much too large for that. Instead, the best models measure the probability flow (e.g. diffusion/normalizing flows/flow-matching), and follow it towards high-probability images. However, parts of images should be discrete. You know humans have five fingers, or text has words in it, but flows assume your probabilities are continuous.
Imagine you have a distribution that looks like
__|_|_|__
A flow will round out those spikes into something closer to
_/^\/^\/^\__
which is why gibberish text or four-and-a-half fingers appear. In video models, this leads to dogs spawning and disappearing into the pack.
Partly when it comes to image/video models, but this isn’t a huge factor.
I think it’s because AI is a winner-takes-all competition. It’s extremely easy for customers to switch, so they all go to the best model. Since ClosedAI already has funding, compute, and infrastructure, it’s risky to compete against them unless you have a new kind of model (e.g. LiquidAI), reputation (e.g. Anthropic), or are a billionaire’s pet project (e.g. xAI).
This is not an answer to the broader question, but just regarding the “no Wikipedia page” thing.
I would like to write a Wikipedia page about Flux, but as it is, there is very little quality information about it. We have a lot of anecdotal information about how to use it, and a little academic description of it, but that’s not enough.
Besides, it seems everyone who can write well in artificial intelligence wants to write their damned academic blog that is read by like 10 people a month and not Wikipedia, and Wikipedia accumulates a large amount of badly written stuff by amateurs.
As an example, see this page
https://en.wikipedia.org/wiki/Generative_adversarial_network
The “Applications” section is a typical example of how stupid and badly formatted it is. Everything above it I wrote myself. Everything below it I only did a light amount of editing. Before I went in to write all of that in 2022-07 (2022! Imagine that! GANs were famous since about 2018 and it waited until 2022 to get a decent Wikipedia page?), the entire page was crap like it: https://en.wikipedia.org/w/index.php?title=Generative_adversarial_network&oldid=1096565363
Similarly for the Transformer. https://en.wikipedia.org/w/index.php?title=Transformer_(deep_learning_architecture)&oldid=1095579622 I have only recently finished writing it. https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture) and then I tried applying for “Good Article” status, and got promptly rejected for not putting enough inline citations (do they really want me to put inline citations everywhere even if that means I just have to refer to the Attention is All You Need paper 30 times?) and too much primary literature and too much arXiv links (not a peer-reviewed source).
The RNN page is also terrible https://en.wikipedia.org/w/index.php?title=Recurrent_neural_network&oldid=1214097285 until I cleaned it up. There is still a large amount of crud but I put all of them in the lower half of the page, so that people know when to stop reading. I put them there just in case some annoyed editor reverts my edit for deleting their favorite section, and in case there is something valuable there (that I can’t be bothered to figure out, because of how badly written it is).
The list of crud goes on and on. The Convolutional Neural Network page is still absolutely terrible. It has a negative amount of value, and I’m too tired to clean it up.
Sometimes there’s an important model that’s entirely neglected. Like the T5 model series. https://en.wikipedia.org/wiki/T5_(language_model) Why this model had to wait until me in 2024 to finally write it, I have no idea.
P.S.: The damned Transformer page gets someone (always a different one) writing in some Schmidhuber-propaganda. I remove it once a month. Why there are so many fans of Schmidhuber, I have no idea.
Without a doubt, the question is very interesting. As it stands, it looks like there’s something that doesn’t fit. It would be interesting to see it from a different angle. To make matters better, it’s not a race to be the first to the AGI. It’s possible that what’s happening is that the costs of training the new models that are in the oven are too high. The investors are thrilled to be able to say that they are the first ones to reach their goal. But don’t get fooled; their main job is to make sure they get back everything they put in. If we put all of these expected costs into one equation, it’s clear that the return has to be great in the medium and short term for it to be a moderately good investment. The truth is that the Top 3′s sales of these models today are very low. From this point of view, all of these big companies that are mentioned in the article should be working hard to find a way to get their money back from their investments.