Executive director at Timaeus. Working on singular learning theory and developmental interpretability.
Website: jessehoogland.com
Twitter: @jesse_hoogland
Executive director at Timaeus. Working on singular learning theory and developmental interpretability.
Website: jessehoogland.com
Twitter: @jesse_hoogland
Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me:
Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
Proliferation by default. There’s an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where:
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
Proliferation is not bottlenecked by infrastructure.
Regulatory control through hardware restriction becomes much less viable.
For now, training still needs industrial compute. But it’s looking increasingly like we won’t be able to contain what comes after.
This is a research direction that dates back to Clift et al. 2021. For a more recent and introductory example, see this post by @Daniel Murfet.
(Note: I’ve edited the announcement to remove explicit mention of geometry of program synthesis.)
I want to point out that there are many interesting symmetries that are non-global or data-dependent. These “non-generic” symmetries can change throughout training. Let me provide a few examples.
ReLU networks. Consider the computation involved in a single layer of a ReLU network:
or, equivalently,
(Maybe we’re looking at a two-layer network where are the inputs and are the outputs, or maybe we’re at some intermediate layer where these variables represent internal activations before and after a given layer.)
Dead neuron . If one of the biases is always larger than the associated preactivation , then the ReLU will always spit out a zero at that index. This “dead” neuron introduces a new continuous symmetry, where you can set the entries of column of to an arbitrary value, without affecting the network’s computation ().
Bypassed neuron . Consider the opposite: if for all possible inputs , then neuron will always activate, and the ReLU’s nonlinearity effectively vanishes at that index. This introduces a new continuous symmetry, where you can insert an arbitrary invertible transformation to the subspace of bypassed neurons between the activations and the final transformation. For the sake of clarity, assume all neurons are bypassed, then:
Hidden polytopes. A ReLU network learns a piecewise linear approximation to a function. For ease, consider the case of learning a 1-dimensional mapping. It might look something like this:
The vertices between polytopes correspond to a set of constraints on the weights. Consider what happens when two neighboring linear pieces line up (left to right). One vertex becomes redundant (dotted lined). You can now move the vertex along the shared polytope without changing the function implemented. This corresponds to a continuous transformation of your weights in some direction of weight space. Importantly this is only true locally— as soon as the vertex reaches the next edge of the shared polytope, pushing it any further will change the function. Moving the vertex in any direction orthogonal to the polytope will also change the function.
You might enjoy this new blogpost from HuggingFace, which goes into more detail.
I agree. My original wording was too restrictive, so let me try again:
I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation’s LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from “scraping and storing lots of data” to “running lots of inference to generate high-quality tokens.” This will change the balance to be easier for some players, harder for others. I don’t think that change in balance is perfectly aligned with frontier labs.
Phi-4: Synthetic data works. Pretraining’s days are numbered.
Microsoft just announced Phi-4, a 14B parameter model that matches GPT-4o on some difficult benchmarks. The accompanying technical report offers a glimpse into the growing importance of synthetic data and how frontier model training is changing.
Some takeaways:
The data wall is looking flimsier by the day. Phi-4 is highly capable not despite but because of synthetic data. It was trained on a curriculum of 50 types of synthetic datasets, generated by GPT-4o from a diverse set of organic data “seeds”. We’re seeing a smooth progression from training on (1) organic data, to (2) human-curated datasets, to (3) AI-curated datasets (filtering for appropriate difficulty, using verifiers), to (4) AI-augmented data (generating Q&A pairs, iteratively refining answers, reverse-engineering instructions from code, etc.), to (5) pure synthetic data.
Training is fracturing. It’s not just the quality and mixture but also the ordering of data that matters. Phi-4 features a “midtraining” phase that expands its context length from 4k to 16k tokens, upweighting long-context behavior only when the model has become capable enough to integrate that extra information. Post-training features a standard SFT phase and two rounds of DPO: one round of DPO using “pivotal token search” to generate minimally distinct pairs that are easier to learn from, and one round of more standard “judge-guided DPO”. In the author’s own words: “An end-to-end optimization of pretraining data mixture that also takes into account the effects of post-training is an interesting future area of investigation.”
The next frontier is self-improvement. Phi-4 was taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself. This progression towards online learning is possible because of amortization: additional inference-time compute spent generating higher quality tokens becomes training data. The techniques range from simple (rejection-sampling multiple answers and iterative refinement) to complex (o1-style reasoning), but the principle remains: AI systems will increasingly be involved in training their successors and then themselves by curating, enhancing, and generating data, and soon by optimizing their own training curricula.
The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s’|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don’t think the RL setup is actually that straightforward.
If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.
The examples they provide one of the announcement blog posts (under the “Chain of Thought” section) suggest this is more than just marketing hype (even if these examples are cherry-picked):
Here are some excerpts from two of the eight examples:
Cipher:
Hmm.
But actually in the problem it says the example:...
Option 2: Try mapping as per an assigned code: perhaps columns of letters?
Alternatively, perhaps the cipher is more complex.
Alternatively, notice that “oyfjdnisdr” has 10 letters and “Think” has 5 letters....
Alternatively, perhaps subtract: 25 −15 = 10.
No.
Alternatively, perhaps combine the numbers in some way.
Alternatively, think about their positions in the alphabet.
Alternatively, perhaps the letters are encrypted via a code.
Alternatively, perhaps if we overlay the word ‘Think’ over the cipher pairs ‘oy’, ‘fj’, etc., the cipher is formed by substituting each plaintext letter with two letters.
Alternatively, perhaps consider the ‘original’ letters.
Science:
Wait, perhaps more accurate to find Kb for F^− and compare it to Ka for NH4+.
...
But maybe not necessary.
...
Wait, but in our case, the weak acid and weak base have the same concentration, because NH4F dissociates into equal amounts of NH4^+ and F^-
...
Wait, the correct formula is:
It’s worth noting that there are also hybrid approaches, for example, where you use automated verifiers (or a combination of automated verifiers and supervised labels) to train a process reward model that you then train your reasoning model against.
See also this related shortform in which I speculate about the relationship between o1 and AIXI:
Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
[Continued]
Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.
AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents. It’s not tool-use that makes an agent; it’s how that agent reasons. Bandwidth comes second.
Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.
Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.
The bitter lesson is that “general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin.” The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that’s left is just a bit more compute...
We still don’t know the exact details of how o1 works. If you’re interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, “o1: A Technical Primer”
We’re not currently hiring, but you can always send us a CV to be kept in the loop and notified of next rounds.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.