Is this apparent parity due to a mass exodus of employees from OpenAI, Anthropic, and Google to other companies, resulting in the diffusion of “secret sauce” ideas across the industry?
No. There isn’t much “secret sauce”, and these companies never had a large amount of AI talent to begin with. Their advantage is being in a position with hype/reputation/size to get to market faster. It takes several months to setup the infrastructure (getting money, data, and compute clusters), but that’s really the only hurdle.
Does this parity exist because other companies are simply piggybacking on Meta’s open-source AI model, which was made possible by Meta’s massive compute resources? Now, by fine-tuning this model, can other companies quickly create models comparable to the best?
No. “Everyone” in the AI research community knew how to build Llama, multi-modal models, or video diffusion models a year before they came out. They just didn’t have $10M to throw around.
Also, fine-tuning isn’t really the way to go. I can imagine people using it as a teacher during the warming up phase, but the coding infrastructure doesn’t really exist to fine-tune or integrate another model as part of a larger one. It’s usually easier to just spend the extra time securing money and training.
Is it plausible that once LLMs were validated and the core idea spread, it became surprisingly simple to build, allowing any company to quickly reach the frontier?
Yep. Even five years ago you could open a Colab notebook and train a language translation model in a couple of minutes.
Are AI image generators just really simple to develop but lack substantial economic reward, leading large companies to invest minimal resources into them?
No, images are much harder than language. With language models, you can exactly model the output distribution, while the space of images is continuous and much too large for that. Instead, the best models measure the probability flow (e.g. diffusion/normalizing flows/flow-matching), and follow it towards high-probability images. However, parts of images should be discrete. You know humans have five fingers, or text has words in it, but flows assume your probabilities are continuous.
Imagine you have a distribution that looks like
__|_|_|__
A flow will round out those spikes into something closer to
_/^\/^\/^\__
which is why gibberish text or four-and-a-half fingers appear. In video models, this leads to dogs spawning and disappearing into the pack.
Could it be that legal challenges in building AI are so significant that big companies are hesitant to fully invest, making it appear as if smaller companies are outperforming them?
Partly when it comes to image/video models, but this isn’t a huge factor.
And finally, why is OpenAI so valuable if it’s apparently so easy for other companies to build comparable tech? Conversely, why are these no name companies making leading LLMs not valued higher?
I think it’s because AI is a winner-takes-all competition. It’s extremely easy for customers to switch, so they all go to the best model. Since ClosedAI already has funding, compute, and infrastructure, it’s risky to compete against them unless you have a new kind of model (e.g. LiquidAI), reputation (e.g. Anthropic), or are a billionaire’s pet project (e.g. xAI).
No. There isn’t much “secret sauce”, and these companies never had a large amount of AI talent to begin with. Their advantage is being in a position with hype/reputation/size to get to market faster. It takes several months to setup the infrastructure (getting money, data, and compute clusters), but that’s really the only hurdle.
No. “Everyone” in the AI research community knew how to build Llama, multi-modal models, or video diffusion models a year before they came out. They just didn’t have $10M to throw around.
Also, fine-tuning isn’t really the way to go. I can imagine people using it as a teacher during the warming up phase, but the coding infrastructure doesn’t really exist to fine-tune or integrate another model as part of a larger one. It’s usually easier to just spend the extra time securing money and training.
Yep. Even five years ago you could open a Colab notebook and train a language translation model in a couple of minutes.
No, images are much harder than language. With language models, you can exactly model the output distribution, while the space of images is continuous and much too large for that. Instead, the best models measure the probability flow (e.g. diffusion/normalizing flows/flow-matching), and follow it towards high-probability images. However, parts of images should be discrete. You know humans have five fingers, or text has words in it, but flows assume your probabilities are continuous.
Imagine you have a distribution that looks like
__|_|_|__
A flow will round out those spikes into something closer to
_/^\/^\/^\__
which is why gibberish text or four-and-a-half fingers appear. In video models, this leads to dogs spawning and disappearing into the pack.
Partly when it comes to image/video models, but this isn’t a huge factor.
I think it’s because AI is a winner-takes-all competition. It’s extremely easy for customers to switch, so they all go to the best model. Since ClosedAI already has funding, compute, and infrastructure, it’s risky to compete against them unless you have a new kind of model (e.g. LiquidAI), reputation (e.g. Anthropic), or are a billionaire’s pet project (e.g. xAI).