Executive director at Timaeus. Working on singular learning theory and developmental interpretability.
Website: jessehoogland.com
Twitter: @jesse_hoogland
Executive director at Timaeus. Working on singular learning theory and developmental interpretability.
Website: jessehoogland.com
Twitter: @jesse_hoogland
Claude 3.7 reward hacks. During training, Claude 3.7 Sonnet sometimes resorted to “special-casing” to pass tests when it got stuck — including directly hardcoding expected outputs or even modifying test files themselves. Rumors are circulating that o1/o3 was doing similar things — like overwriting equality operators to get Python tests to pass — and this may have contributed to the delayed release.
This seems relevant to claims that “we’ll soon have reward models sophisticated enough to understand human values” and that inner alignment is the real challenge. Instead, we’re seeing real examples of reward-hacking at the frontier.
RL is becoming important again. We should expect old failure modes to rear their ugly heads.
But “models have singularities and thus number of parameters is not a good complexity measure” is not a valid criticism of VC theory.
Right, this quote is really a criticism of the classical Bayesian Information Criterion (for which the “Widely applicable Bayesian Information Criterion” WBIC is the relevant SLT generalization).
Ah, I didn’t realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us “simpler” models and simpler models generalize better.
That’s right: existing work is about in-distribution generalization. It is the case that, within the Bayesian setting, SLT provides an essentially complete account of in-distribution generalization. As you’ve pointed out there are remaining differences between Bayes and SGD. We’re working on applications to OOD but have not put anything out publicly about this yet.
To be precise, it is a property of singular models (which includes neural networks) in the Bayesian setting. There are good empirical reasons to expect the same to be true for neural networks trained with SGD (across a wide range of different models, we observe the LLC progressively increase from ~0 over the course of training).
The key distinction is that VC theory takes a global, worst-case approach — it tries to bound generalization uniformly across an entire model class. This made sense historically but breaks down for modern neural networks, which are so expressive that the worst-case is always very bad and doesn’t get you anywhere.
The statistical learning theory community woke up to this fact (somewhat) with the Zhang et al. paper, which showed that deep neural networks can achieve perfect training loss on randomly labeled data (even with regularization). The same networks, when trained on natural data, will generalize well. VC dimension can’t explain this. If you can fit random noise, you get a huge (or even infinite) VC dimension and the resulting bounds fail to explain empircally observed generalization performance.
So I’d argue that dependence on the true-data distribution isn’t a weakness, but one of SLT’s great strengths. For highly expressive model classes, generalization only makes sense in reference to a data distribution. Global, uniform approaches like VC theory do not explain why neural networks generalize.
Thus if multiple parameter values lead to the same behaviour, this isn’t a problem for the theory at all because these redundancies do not increase the VC-dimension of the model class.
Multiple parameter values leading to the same behavior isn’t a problem — this is “the one weird trick.” The reason you don’t get the terribly generalizing solution that is overfit to noise is because simple solutions occupy more volume in the loss landscape, and are therefore easier to find. At the same time, simpler solutions generalize better (this is intuitively what Occam’s razor is getting at, though you can make it precise in the Bayesian setting). So it’s the solutions that generalize best that end up getting found.
If the claim is that it only needs to know certain properties of the true distribution that can be estimated from a small number of samples, then it will be nice to have a proof of such a claim (not sure if that exists).
I would say that this is a motivating conjecture and deep open problem (see, e.g., the natural abstractions agenda). I believe that something like this has to be true for learning to be at all possible. Real-world data distributions have structure; they do not resemble noise. This difference is what enables models to learn to generalize from finite samples.
Also note that if is allowed access to samples, then predicting whether your model generalizes is as simple as checking its performance on the test set.
For in-distribution generalization, yes, this is more or less true. But what we’d really like to get at is an understanding of how perturbations to the true distribution lead to changes in model behavior. That is, out-of-distribution generalization. Classical VC theory is completely hopeless when it comes to this. This only makes sense if you’re taking a more local approach.
See also my post on generalization here.
Okay, great, then we just have to wait a year for AlphaProofZero to get a perfect score on the IMO.
Yes, my original comment wasn’t clear about this, but your nitpick is actually a key part of what I’m trying to get at.
Usually, you start with imitation learning and tack on RL at the end. That’s what AlphaGo is. It’s what predecessors to Dreamer-V3 like VPT are. It’s what current reasoning models are.
But then, eventually, you figure out how to bypass the imitation learning/behavioral cloning part and do RL from the start. Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.
I think this is important because the safety community still isn’t thinking very much about search & RL, even after all the recent progress with reasoning models. We’ve updated very far away from AlphaZero as a reference class, and I think we will regret this.
On the other hand, the ideas I’m talking about here seem to have widespread recognition among people working on capabilities. Demis is very transparent about where they’re headed with language models, AlphaZero, and open-ended exploration (e.g., at 20:48). Noam Brown is adamant about test-time scaling/reasoning being the future (e.g., at 20:32). I think R1 has driven the message home for everyone else.
With AlphaProof, the relevant piece is that the solver network generates its own proofs and disproofs to train against. There’s no imitation learning after formalization. There is a slight disanalogy where, for formalization, we mostly jumped straight to self-play/search, and I don’t think there was ever a major imitation-learning-based approach (though I did find at least one example).
Your quote “when reinforcement learning works well, imitation learning is no longer needed” is pretty close to what I mean. What I’m actually trying to get at is a stronger statement: we often bootstrap using imitation learning to figure out how to get the reinforcement learning component working initially, but once we do, we can usually discard the imitation learning entirely.
That’s fun but a little long. Why not… BetaZero?
What do you call this phenomenon?
First, you train AlphaGo on expert human examples. This is enough to beat Lee Sedol and Ke Jie. Then, you train AlphaZero purely through self-play. It destroys AlphaGo after only a few hours.
First, you train RL agents on human playthroughs of Minecraft. They do okay. Then, DreamerV3 learns entirely by itself and becomes the first to get diamonds.
First, you train theorem provers on human proofs. Then, you train AlphaProof using AlphaZero and you get silver on IMO for the first time.
First, you pretrain a language model on all human data. Then...
This feels like a special case of the bitter lesson, but it’s not the same thing. It seems to rely on the distinction between prediction and search latent in ideas like AISI. It’s the kind of thing that I’m sure Gwern has christened in some comment lost to the internet’s backwaters. We should have a name for it—something more refined than just “foom.”
We won’t strictly require it, but we will probably strongly encourage it. It’s not disqualifying, but it could make the difference between two similar candidates.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me:
Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
Proliferation by default. There’s an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where:
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
Proliferation is not bottlenecked by infrastructure.
Regulatory control through hardware restriction becomes much less viable.
For now, training still needs industrial compute. But it’s looking increasingly like we won’t be able to contain what comes after.
This is a research direction that dates back to Clift et al. 2021. For a more recent and introductory example, see this post by @Daniel Murfet.
(Note: I’ve edited the announcement to remove explicit mention of geometry of program synthesis.)
I want to point out that there are many interesting symmetries that are non-global or data-dependent. These “non-generic” symmetries can change throughout training. Let me provide a few examples.
ReLU networks. Consider the computation involved in a single layer of a ReLU network:
or, equivalently,
(Maybe we’re looking at a two-layer network where are the inputs and are the outputs, or maybe we’re at some intermediate layer where these variables represent internal activations before and after a given layer.)
Dead neuron . If one of the biases is always larger than the associated preactivation , then the ReLU will always spit out a zero at that index. This “dead” neuron introduces a new continuous symmetry, where you can set the entries of column of to an arbitrary value, without affecting the network’s computation ().
Bypassed neuron . Consider the opposite: if for all possible inputs , then neuron will always activate, and the ReLU’s nonlinearity effectively vanishes at that index. This introduces a new continuous symmetry, where you can insert an arbitrary invertible transformation to the subspace of bypassed neurons between the activations and the final transformation. For the sake of clarity, assume all neurons are bypassed, then:
Hidden polytopes. A ReLU network learns a piecewise linear approximation to a function. For ease, consider the case of learning a 1-dimensional mapping. It might look something like this:
The vertices between polytopes correspond to a set of constraints on the weights. Consider what happens when two neighboring linear pieces line up (left to right). One vertex becomes redundant (dotted lined). You can now move the vertex along the shared polytope without changing the function implemented. This corresponds to a continuous transformation of your weights in some direction of weight space. Importantly this is only true locally— as soon as the vertex reaches the next edge of the shared polytope, pushing it any further will change the function. Moving the vertex in any direction orthogonal to the polytope will also change the function.
You might enjoy this new blogpost from HuggingFace, which goes into more detail.
I agree. My original wording was too restrictive, so let me try again:
I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation’s LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from “scraping and storing lots of data” to “running lots of inference to generate high-quality tokens.” This will change the balance to be easier for some players, harder for others. I don’t think that change in balance is perfectly aligned with frontier labs.
Phi-4: Synthetic data works. Pretraining’s days are numbered.
Microsoft just announced Phi-4, a 14B parameter model that matches GPT-4o on some difficult benchmarks. The accompanying technical report offers a glimpse into the growing importance of synthetic data and how frontier model training is changing.
Some takeaways:
The data wall is looking flimsier by the day. Phi-4 is highly capable not despite but because of synthetic data. It was trained on a curriculum of 50 types of synthetic datasets, generated by GPT-4o from a diverse set of organic data “seeds”. We’re seeing a smooth progression from training on (1) organic data, to (2) human-curated datasets, to (3) AI-curated datasets (filtering for appropriate difficulty, using verifiers), to (4) AI-augmented data (generating Q&A pairs, iteratively refining answers, reverse-engineering instructions from code, etc.), to (5) pure synthetic data.
Training is fracturing. It’s not just the quality and mixture but also the ordering of data that matters. Phi-4 features a “midtraining” phase that expands its context length from 4k to 16k tokens, upweighting long-context behavior only when the model has become capable enough to integrate that extra information. Post-training features a standard SFT phase and two rounds of DPO: one round of DPO using “pivotal token search” to generate minimally distinct pairs that are easier to learn from, and one round of more standard “judge-guided DPO”. In the author’s own words: “An end-to-end optimization of pretraining data mixture that also takes into account the effects of post-training is an interesting future area of investigation.”
The next frontier is self-improvement. Phi-4 was taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself. This progression towards online learning is possible because of amortization: additional inference-time compute spent generating higher quality tokens becomes training data. The techniques range from simple (rejection-sampling multiple answers and iterative refinement) to complex (o1-style reasoning), but the principle remains: AI systems will increasingly be involved in training their successors and then themselves by curating, enhancing, and generating data, and soon by optimizing their own training curricula.
The implication: If you don’t have access to a 2024-frontier AI, you’re going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s’|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don’t think the RL setup is actually that straightforward.
If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.
Looking back at this, I think this post is outdated and was trying a little too hard to be provocative. I agree with everything you say here. Especially: “One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!”
Forgive my youthful naïvité. For what it’s worth, I think the generalization post in this sequence has stood the test of time much better.