Phi-4: Synthetic data works. Pretraining’s days are numbered.
Microsoft just announced Phi-4, a 14B parameter model that matches GPT-4o on some difficult benchmarks. The accompanying technical report offers a glimpse into the growing importance of synthetic data and how frontier model training is changing.
Some takeaways:
The data wall is looking flimsier by the day. Phi-4 is highly capable not despite but because of synthetic data. It was trained on a curriculum of 50 types of synthetic datasets, generated by GPT-4o from a diverse set of organic data “seeds”. We’re seeing a smooth progression from training on (1) organic data, to (2) human-curated datasets, to (3) AI-curated datasets (filtering for appropriate difficulty, using verifiers), to (4) AI-augmented data (generating Q&A pairs, iteratively refining answers, reverse-engineering instructions from code, etc.), to (5) pure synthetic data.
Training is fracturing. It’s not just the quality and mixture but also the ordering of data that matters. Phi-4 features a “midtraining” phase that expands its context length from 4k to 16k tokens, upweighting long-context behavior only when the model has become capable enough to integrate that extra information. Post-training features a standard SFT phase and two rounds of DPO: one round of DPO using “pivotal token search” to generate minimally distinct pairs that are easier to learn from, and one round of more standard “judge-guided DPO”. In the author’s own words: “An end-to-end optimization of pretraining data mixture that also takes into account the effects of post-training is an interesting future area of investigation.”
The next frontier is self-improvement. Phi-4 was taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself. This progression towards online learning is possible because of amortization: additional inference-time compute spent generating higher quality tokens becomes training data. The techniques range from simple (rejection-sampling multiple answers and iterative refinement) to complex (o1-style reasoning), but the principle remains: AI systems will increasingly be involved in training their successors and then themselves by curating, enhancing, and generating data, and soon by optimizing their own training curricula.
The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
I don’t think Phi-4 offers convincing evidence either way. You can push performance on verifiable tasks quite far without the model becoming generally more capable. AlphaZero doesn’t imply that scaling with its methods gestures at general superintelligence, and similarly with Phi-4.
In contrast, using o1-like training as a way to better access ground truth in less tractable domains seems more promising, since by some accounts its tactics on long reasoning traces work even in non-technical domains (unlike for DeepSeek R1), possibly because they are emergent rather than directly encouraged with task-specific training.
Phi-4 is highly capable not despite but because of synthetic data.
Imitation models tend to be quite brittle outside of their narrowly imitated domain, and I suspect the same to be the case for phi-4. Some of the decontamination measures they took provide some counter evidence to this but not much. I’d update more strongly if I saw results on benchmarks which contained in them the generality and diversity of tasks required to do meaningful autonomous cognitive labour “in the wild”, such as SWE-Bench (or rather what I understand SWE-Bench to be, I have yet to play very closely with it).
Phi-4 is taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself.
There’s an important distinction between utilizing synthetic data in teacher-student setups and utilizing synthetic data in self-teaching. While synthetic data is a demonstrably powerful way of augmenting human feedback, my current estimation is that typical mode collapse arguments still hold for self generated purely synthetic datasets, and that phi-4 doesn’t provide counter-evidence against this.
>The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
This doesn’t seem super clear to me. Without synthetic data, you need to scrape large parts of the web and manage a lot of storage infrastructure. This can either be done illegally, or with complex negotiations (especially as companies are catching on to this).
In comparison, it would be very useful if you could train a near-SOTA model with just synthetic data, say from an open-source model. This might not bring you all the way to SOTA, but close might be good enough for many things.
I agree. My original wording was too restrictive, so let me try again:
I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation’s LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from “scraping and storing lots of data” to “running lots of inference to generate high-quality tokens.” This will change the balance to be easier for some players, harder for others. I don’t think that change in balance is perfectly aligned with frontier labs.
One tiny point—I think the phrase “synthetic data” arguably breaks down at some point. “Synthetic data” sounds to me like, “we’re generating fake data made to come from a similar distribution to ‘real data’.” But I assume that a lot of the data we’ll get with inference will be things more like straightforward reasoning.
For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those. Technically this is “synthetic data”, but I don’t see why this data is fundamentally different than similar mathematics that humans do. This data is typically the synthesis or distillation of much longer search and reasoning processes.
As such, it seems very sensible to me to expect “synthetic data” to be a major deal.
For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those.
Would there have to be human vetting to check that O1’s solutions are correct? The practicality of that would depend on the scale, but you don’t want to end up with a blurry JPEG of a blurry JPEG of the internet.
For mathematical lemmas you can formalize them in a language like Lean to automatically check correctness. So access to ground truth is even clearer than for programming, the main issue is probably finding a lot of sane formalized things to prove that the system is capable of proving.
Another interesting take-away to me—I didn’t realize that Microsoft was doing much training of it’s own. It makes a lot of sense that they’d want their own teams making their own models, in part to hedge around OpenAI.
I’m curious what their strategy will be in the next few years.
It looks like recursive self-improvement is here for the base case, at least. It will be interesting to see if anyone uses solely Phi-4 to pretrain a more capable model.
Phi-4: Synthetic data works. Pretraining’s days are numbered.
Microsoft just announced Phi-4, a 14B parameter model that matches GPT-4o on some difficult benchmarks. The accompanying technical report offers a glimpse into the growing importance of synthetic data and how frontier model training is changing.
Some takeaways:
The data wall is looking flimsier by the day. Phi-4 is highly capable not despite but because of synthetic data. It was trained on a curriculum of 50 types of synthetic datasets, generated by GPT-4o from a diverse set of organic data “seeds”. We’re seeing a smooth progression from training on (1) organic data, to (2) human-curated datasets, to (3) AI-curated datasets (filtering for appropriate difficulty, using verifiers), to (4) AI-augmented data (generating Q&A pairs, iteratively refining answers, reverse-engineering instructions from code, etc.), to (5) pure synthetic data.
Training is fracturing. It’s not just the quality and mixture but also the ordering of data that matters. Phi-4 features a “midtraining” phase that expands its context length from 4k to 16k tokens, upweighting long-context behavior only when the model has become capable enough to integrate that extra information. Post-training features a standard SFT phase and two rounds of DPO: one round of DPO using “pivotal token search” to generate minimally distinct pairs that are easier to learn from, and one round of more standard “judge-guided DPO”. In the author’s own words: “An end-to-end optimization of pretraining data mixture that also takes into account the effects of post-training is an interesting future area of investigation.”
The next frontier is self-improvement. Phi-4 was taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself. This progression towards online learning is possible because of amortization: additional inference-time compute spent generating higher quality tokens becomes training data. The techniques range from simple (rejection-sampling multiple answers and iterative refinement) to complex (o1-style reasoning), but the principle remains: AI systems will increasingly be involved in training their successors and then themselves by curating, enhancing, and generating data, and soon by optimizing their own training curricula.
The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
I don’t think Phi-4 offers convincing evidence either way. You can push performance on verifiable tasks quite far without the model becoming generally more capable. AlphaZero doesn’t imply that scaling with its methods gestures at general superintelligence, and similarly with Phi-4.
In contrast, using o1-like training as a way to better access ground truth in less tractable domains seems more promising, since by some accounts its tactics on long reasoning traces work even in non-technical domains (unlike for DeepSeek R1), possibly because they are emergent rather than directly encouraged with task-specific training.
Imitation models tend to be quite brittle outside of their narrowly imitated domain, and I suspect the same to be the case for phi-4. Some of the decontamination measures they took provide some counter evidence to this but not much. I’d update more strongly if I saw results on benchmarks which contained in them the generality and diversity of tasks required to do meaningful autonomous cognitive labour “in the wild”, such as SWE-Bench (or rather what I understand SWE-Bench to be, I have yet to play very closely with it).
There’s an important distinction between utilizing synthetic data in teacher-student setups and utilizing synthetic data in self-teaching. While synthetic data is a demonstrably powerful way of augmenting human feedback, my current estimation is that typical mode collapse arguments still hold for self generated purely synthetic datasets, and that phi-4 doesn’t provide counter-evidence against this.
This is neat, thanks for highlighting.
>The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
This doesn’t seem super clear to me. Without synthetic data, you need to scrape large parts of the web and manage a lot of storage infrastructure. This can either be done illegally, or with complex negotiations (especially as companies are catching on to this).
In comparison, it would be very useful if you could train a near-SOTA model with just synthetic data, say from an open-source model. This might not bring you all the way to SOTA, but close might be good enough for many things.
I agree. My original wording was too restrictive, so let me try again:
I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation’s LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from “scraping and storing lots of data” to “running lots of inference to generate high-quality tokens.” This will change the balance to be easier for some players, harder for others. I don’t think that change in balance is perfectly aligned with frontier labs.
Yep, that makes sense to me.
One tiny point—I think the phrase “synthetic data” arguably breaks down at some point. “Synthetic data” sounds to me like, “we’re generating fake data made to come from a similar distribution to ‘real data’.” But I assume that a lot of the data we’ll get with inference will be things more like straightforward reasoning.
For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those. Technically this is “synthetic data”, but I don’t see why this data is fundamentally different than similar mathematics that humans do. This data is typically the synthesis or distillation of much longer search and reasoning processes.
As such, it seems very sensible to me to expect “synthetic data” to be a major deal.
Would there have to be human vetting to check that O1’s solutions are correct? The practicality of that would depend on the scale, but you don’t want to end up with a blurry JPEG of a blurry JPEG of the internet.
For mathematical lemmas you can formalize them in a language like Lean to automatically check correctness. So access to ground truth is even clearer than for programming, the main issue is probably finding a lot of sane formalized things to prove that the system is capable of proving.
Another interesting take-away to me—I didn’t realize that Microsoft was doing much training of it’s own. It makes a lot of sense that they’d want their own teams making their own models, in part to hedge around OpenAI.
I’m curious what their strategy will be in the next few years.
It looks like recursive self-improvement is here for the base case, at least. It will be interesting to see if anyone uses solely Phi-4 to pretrain a more capable model.