>The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
This doesn’t seem super clear to me. Without synthetic data, you need to scrape large parts of the web and manage a lot of storage infrastructure. This can either be done illegally, or with complex negotiations (especially as companies are catching on to this).
In comparison, it would be very useful if you could train a near-SOTA model with just synthetic data, say from an open-source model. This might not bring you all the way to SOTA, but close might be good enough for many things.
I agree. My original wording was too restrictive, so let me try again:
I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation’s LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from “scraping and storing lots of data” to “running lots of inference to generate high-quality tokens.” This will change the balance to be easier for some players, harder for others. I don’t think that change in balance is perfectly aligned with frontier labs.
One tiny point—I think the phrase “synthetic data” arguably breaks down at some point. “Synthetic data” sounds to me like, “we’re generating fake data made to come from a similar distribution to ‘real data’.” But I assume that a lot of the data we’ll get with inference will be things more like straightforward reasoning.
For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those. Technically this is “synthetic data”, but I don’t see why this data is fundamentally different than similar mathematics that humans do. This data is typically the synthesis or distillation of much longer search and reasoning processes.
As such, it seems very sensible to me to expect “synthetic data” to be a major deal.
For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those.
Would there have to be human vetting to check that O1’s solutions are correct? The practicality of that would depend on the scale, but you don’t want to end up with a blurry JPEG of a blurry JPEG of the internet.
For mathematical lemmas you can formalize them in a language like Lean to automatically check correctness. So access to ground truth is even clearer than for programming, the main issue is probably finding a lot of sane formalized things to prove that the system is capable of proving.
Another interesting take-away to me—I didn’t realize that Microsoft was doing much training of it’s own. It makes a lot of sense that they’d want their own teams making their own models, in part to hedge around OpenAI.
I’m curious what their strategy will be in the next few years.
This is neat, thanks for highlighting.
>The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
This doesn’t seem super clear to me. Without synthetic data, you need to scrape large parts of the web and manage a lot of storage infrastructure. This can either be done illegally, or with complex negotiations (especially as companies are catching on to this).
In comparison, it would be very useful if you could train a near-SOTA model with just synthetic data, say from an open-source model. This might not bring you all the way to SOTA, but close might be good enough for many things.
I agree. My original wording was too restrictive, so let me try again:
I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation’s LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from “scraping and storing lots of data” to “running lots of inference to generate high-quality tokens.” This will change the balance to be easier for some players, harder for others. I don’t think that change in balance is perfectly aligned with frontier labs.
Yep, that makes sense to me.
One tiny point—I think the phrase “synthetic data” arguably breaks down at some point. “Synthetic data” sounds to me like, “we’re generating fake data made to come from a similar distribution to ‘real data’.” But I assume that a lot of the data we’ll get with inference will be things more like straightforward reasoning.
For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those. Technically this is “synthetic data”, but I don’t see why this data is fundamentally different than similar mathematics that humans do. This data is typically the synthesis or distillation of much longer search and reasoning processes.
As such, it seems very sensible to me to expect “synthetic data” to be a major deal.
Would there have to be human vetting to check that O1’s solutions are correct? The practicality of that would depend on the scale, but you don’t want to end up with a blurry JPEG of a blurry JPEG of the internet.
For mathematical lemmas you can formalize them in a language like Lean to automatically check correctness. So access to ground truth is even clearer than for programming, the main issue is probably finding a lot of sane formalized things to prove that the system is capable of proving.
Another interesting take-away to me—I didn’t realize that Microsoft was doing much training of it’s own. It makes a lot of sense that they’d want their own teams making their own models, in part to hedge around OpenAI.
I’m curious what their strategy will be in the next few years.