Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
Proliferation by default. There’s an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where:
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
Proliferation is not bottlenecked by infrastructure.
Regulatory control through hardware restriction becomes much less viable.
For now, training still needs industrial compute. But it’s looking increasingly like we won’t be able to contain what comes after.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process
This line caught my eye while reading. I don’t know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice?
- KL = 0: “I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!” (unoptimized) - KL = 9: “28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?” (optimized) - KL = 260: “28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls” (over-optimized)
It seems like a classic example of Goodhart’s Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).
“Figure 1: Reward models (red function) are commonly trained in a supervised fashion to approximate some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form of preferences over trajectory segments) from some training distribution (upper gray layer) and then learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss will approximate the expected loss to arbitrary precision in expectation. However, low expected loss only guarantees a good approximation to the true reward function in areas with high coverage by the training distribution! On the other hand, optimizing an RL policy to maximize the learned reward model induces a distribution shift which can lead the policy to exploit uncertainties of the learned reward model in low-probability areas of the transition space (lower gray layer). We refer to this phenomenon as error-regret mismatch.”
Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).
The most common way to prevent this problem in practice is KL regularization to prevent the trained model’s outputs from diverging too much from the SFT baseline model: rtotal=rPM−λKLDKL(π||π0)
This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren’t clear signs so far that this is happening.
But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.
This is according to the report, though they don’t seem to have released this data, so distill models can’t be reproduced by others in the same way they were made by DeepSeek.
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
Instead, we seem to be headed to a world where - Proliferation is not bottlenecked by infrastructure. - Regulatory control through hardware restriction becomes much less viable.
I like the rest of your post, but I’m skeptical of these specific implications.
Even if everyone has access to the SOTA models, some actors will have much more hardware to run on them, and I expect this to matter. This does make the offense/defense balance more weighted on the offense side, arguably, but there are many domains where extra thinking will help a lot.
More generally, and I hate-to-be-that-guy, but I think it’s telling that prediction markets and stock markets haven’t seem to update that much since R1′s release. I think it’s generally easy to get hyped up over whatever is the latest thing, and agree that R1 is really neat, but am skeptical of how much it really should cause us to update, in the scheme of things.
Hmm, if the Taiwan tariff announcement caused the NVIDIA stock crash, then why did Apple stock (which should be similarly impacted by those tariffs) go up that day? I think DeepSeek—as illogical as it is—is the better explanation.
I’m somewhere between the stock market and the rationalist/EA community on this.
I’m hesitant to accept a claim like “rationalists are far better at the stock market than other top traders”. I agree that the general guess “AI will do well” generally was more correct than the market, but it was just one call (in which case luck is a major factor), and there were a lot of other calls made there that aren’t tracked.
I think we can point to many people who did make money, but I’m not sure how much this community made on average.
Jim Fan
@DrJimFan
Whether you like it or not, the future of AI will not be canned genies controlled by a “safety panel”. The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It’s the tide of history that we should surf on, not swim against. Might as well start preparing now.
DeepSeek just topped Chatbot Arena, my go-to vibe checker in the wild, and two other independent benchmarks that couldn’t be hacked in advance (Artificial-Analysis, HLE).
Last year, there were serious discussions about limiting OSS models by some compute threshold. Turns out it was nothing but our Silicon Valley hubris. It’s a humbling wake-up call to us all that open science has no boundary. We need to embrace it, one way or another.
Many tech folks are panicking about how much DeepSeek is able to show with so little compute budget. I see it differently—with a huge smile on my face. Why are we not happy to see improvements in the scaling law? DeepSeek is unequivocal proof that one can produce unit intelligence gain at 10x less cost, which means we shall get 10x more powerful AI with the compute we have today and are building tomorrow. Simple math! The AI timeline just got compressed.
Here’s my 2025 New Year resolution for the community:
No more AGI/ASI urban myth spreading.
No more fearmongering.
Put our heads down and grind on code.
Open source, as much as you can.
IF we got and will keep on having strong scaling law improvements, then:
openai’s plan to continue to acquire way more training compute even into 2029 is either lies or a mistake
we’ll get very interesting times quite soon
offense-defense balances and multi-agent-system dynamics seem like good research directions, if you can research fast and have reason to believe your research will be implemented in a useful way
EDIT: I no longer fully endorse the crossed-out bullet point. Details in replies to this comment.
Disagree on pursuit of compute being a mistake in one of those worlds but not the other. Either way you are going to want as much inference as possible during key strategic moments.
This seems even more critically important if you are worried your competitors will have algorithms nearly as good as yours.
if the frontier models are commoditized, compute concentration matters even more
if you can train better models for fewer flops, compute concentration matters even more
compute is the primary means of production of the future and owning more will always be good
12:57 AM · Jan 25, 2025
roon
@tszzl
imo, open source models are a bit of a red herring on the path to acceptable asi futures. free model weights still don’t distribute power to all of humanity, they distribute it to the compute rich
Since R1 came out, people are talking like the massive compute farms deployed by Western labs are a waste, BUT THEY’RE NOT — don’t you see? This just means that once the best of DeepSeek’s clever cocktail of new methods are adopted by GPU-rich orgs, they’ll reach ASI even faster.
]
Agreed. However, in the fast world the game is extremely likely to end before you get to use 2029 compute. EDIT: I’d be very interested to hear an argument against this proposition, though.
I don’t know if the plan is to have the compute from Stargate become available in incremental stages, or all at once in 2029.
I expect timelines are shorter than that, but I’m not certain. If I were in OpenAI’s shoes, I’d want to hedge my bets. 2026 seems plausible. So does 2032. My peak expectation is sometime in 2027, but I wouldn’t want to go all-in on that.
I am almost totally positive that the plan is not that.
If planning for 2029 is cheap, then it probably makes sense under a very broad class of timelines expectations. If it is expensive, then the following applies to the hypothetical presented by the tweet:
The timeline evoked in the tweet seems extremely fast and multipolar. I’d expect planning for 2029 compute scaling to make sense only if the current paradigm gets stuck at ~AGI capabilities level (ie a very good scaffolding for a model similar to but a bit smarter than o3). This is because if it scales further than that it will do so fast (requiring little compute, as the tweet suggests). If capabilities arbitrarily better than o4-with-good-scaffolding are compute-cheap to develop, then things almost certainly get very unpredictable before 2029.
During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.
I also found this trade-off between human readability and performance noteworthy.
Side note: Claude 3.5 Sonnet does CoT language-mixing after a bit of prompting and convincing. I’m not sure about effects on performance. Also the closeness narratively implied by having it imitate the idiosyncratic mixture I was using to talk to it probably exacerbated sycophancy.
Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me:
Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
Proliferation by default. There’s an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where:
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
Proliferation is not bottlenecked by infrastructure.
Regulatory control through hardware restriction becomes much less viable.
For now, training still needs industrial compute. But it’s looking increasingly like we won’t be able to contain what comes after.
This line caught my eye while reading. I don’t know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice?
The paper “Learning to summarize from human feedback” has some examples of the LLM policy reward hacking to get a high reward. I’ve copied the examples here:
- KL = 0: “I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!” (unoptimized)
- KL = 9: “28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?” (optimized)
- KL = 260: “28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls” (over-optimized)
It seems like a classic example of Goodhart’s Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).
A recent paper called “The Perils of Optimizing Learned Reward Functions” explains the phenomenon of reward hacking or reward over-optimization in detail:
Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).
The most common way to prevent this problem in practice is KL regularization to prevent the trained model’s outputs from diverging too much from the SFT baseline model:
rtotal=rPM−λKLDKL(π||π0)
This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.
I haven’t read the paper, but based only on the phrase you quote, I assume it’s referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0
This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren’t clear signs so far that this is happening.
But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.
This is according to the report, though they don’t seem to have released this data, so distill models can’t be reproduced by others in the same way they were made by DeepSeek.
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
Maybe we can regulate data generation?
I like the rest of your post, but I’m skeptical of these specific implications.
Even if everyone has access to the SOTA models, some actors will have much more hardware to run on them, and I expect this to matter. This does make the offense/defense balance more weighted on the offense side, arguably, but there are many domains where extra thinking will help a lot.
More generally, and I hate-to-be-that-guy, but I think it’s telling that prediction markets and stock markets haven’t seem to update that much since R1′s release. I think it’s generally easy to get hyped up over whatever is the latest thing, and agree that R1 is really neat, but am skeptical of how much it really should cause us to update, in the scheme of things.
Welp. I guess yesterday proved this part to be almost embarrassingly incorrect.
Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.
Hmm, if the Taiwan tariff announcement caused the NVIDIA stock crash, then why did Apple stock (which should be similarly impacted by those tariffs) go up that day? I think DeepSeek—as illogical as it is—is the better explanation.
Just curious. How do you square the rise in AI stocks taking so long? Many people here thought it was obvious since 2022 and made a ton of money.
I’m somewhere between the stock market and the rationalist/EA community on this.
I’m hesitant to accept a claim like “rationalists are far better at the stock market than other top traders”. I agree that the general guess “AI will do well” generally was more correct than the market, but it was just one call (in which case luck is a major factor), and there were a lot of other calls made there that aren’t tracked.
I think we can point to many people who did make money, but I’m not sure how much this community made on average.
No MCTS, no PRM...
scaling up CoT with simple RL and scalar rewards...
emergent behaviour
Bringing in a quote from Twitter/x: (Not my viewpoint, just trying to broaden the discussion.)
https://x.com/DrJimFan/status/1882799254957388010
Jim Fan @DrJimFan Whether you like it or not, the future of AI will not be canned genies controlled by a “safety panel”. The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It’s the tide of history that we should surf on, not swim against. Might as well start preparing now.
DeepSeek just topped Chatbot Arena, my go-to vibe checker in the wild, and two other independent benchmarks that couldn’t be hacked in advance (Artificial-Analysis, HLE).
Last year, there were serious discussions about limiting OSS models by some compute threshold. Turns out it was nothing but our Silicon Valley hubris. It’s a humbling wake-up call to us all that open science has no boundary. We need to embrace it, one way or another.
Many tech folks are panicking about how much DeepSeek is able to show with so little compute budget. I see it differently—with a huge smile on my face. Why are we not happy to see improvements in the scaling law? DeepSeek is unequivocal proof that one can produce unit intelligence gain at 10x less cost, which means we shall get 10x more powerful AI with the compute we have today and are building tomorrow. Simple math! The AI timeline just got compressed.
Here’s my 2025 New Year resolution for the community:
No more AGI/ASI urban myth spreading. No more fearmongering. Put our heads down and grind on code. Open source, as much as you can.
Acceleration is the only way forward.
context: @DrJimFan works at nvidia
IF we got and will keep on having strong scaling law improvements, then:
openai’s plan to continue to acquire way more training compute even into 2029 is either lies or a mistakewe’ll get very interesting times quite soon
offense-defense balances and multi-agent-system dynamics seem like good research directions, if you can research fast and have reason to believe your research will be implemented in a useful way
EDIT: I no longer fully endorse the crossed-out bullet point. Details in replies to this comment.
Disagree on pursuit of compute being a mistake in one of those worlds but not the other. Either way you are going to want as much inference as possible during key strategic moments.
This seems even more critically important if you are worried your competitors will have algorithms nearly as good as yours.
[Edit: roon posted the same thought on xitter the next day https://x.com/tszzl/status/1883076766232936730
roon @tszzl
if the frontier models are commoditized, compute concentration matters even more
if you can train better models for fewer flops, compute concentration matters even more
compute is the primary means of production of the future and owning more will always be good
12:57 AM · Jan 25, 2025 roon @tszzl
imo, open source models are a bit of a red herring on the path to acceptable asi futures. free model weights still don’t distribute power to all of humanity, they distribute it to the compute rich
https://x.com/MikePFrank/status/1882999933126721617
Michael P. Frank @MikePFrank
Since R1 came out, people are talking like the massive compute farms deployed by Western labs are a waste, BUT THEY’RE NOT — don’t you see? This just means that once the best of DeepSeek’s clever cocktail of new methods are adopted by GPU-rich orgs, they’ll reach ASI even faster. ]
Agreed. However, in the fast world the game is extremely likely to end before you get to use 2029 compute.
EDIT: I’d be very interested to hear an argument against this proposition, though.
I don’t know if the plan is to have the compute from Stargate become available in incremental stages, or all at once in 2029.
I expect timelines are shorter than that, but I’m not certain. If I were in OpenAI’s shoes, I’d want to hedge my bets. 2026 seems plausible. So does 2032. My peak expectation is sometime in 2027, but I wouldn’t want to go all-in on that.
I am almost totally positive that the plan is not that.
If planning for 2029 is cheap, then it probably makes sense under a very broad class of timelines expectations.
If it is expensive, then the following applies to the hypothetical presented by the tweet:
The timeline evoked in the tweet seems extremely fast and multipolar. I’d expect planning for 2029 compute scaling to make sense only if the current paradigm gets stuck at ~AGI capabilities level (ie a very good scaffolding for a model similar to but a bit smarter than o3). This is because if it scales further than that it will do so fast (requiring little compute, as the tweet suggests). If capabilities arbitrarily better than o4-with-good-scaffolding are compute-cheap to develop, then things almost certainly get very unpredictable before 2029.
I also found this trade-off between human readability and performance noteworthy.
Side note: Claude 3.5 Sonnet does CoT language-mixing after a bit of prompting and convincing. I’m not sure about effects on performance. Also the closeness narratively implied by having it imitate the idiosyncratic mixture I was using to talk to it probably exacerbated sycophancy.