OpenAI o1, Llama 4, and AlphaZero of LLMs
GPT-4 level open weights models like Llama-3-405B don’t seem capable of dangerous cognition. OpenAI o1 demonstrates that a GPT-4 level model can be post-trained into producing useful long horizon reasoning traces. AlphaZero shows how capabilities can be obtained from compute alone, with no additional data. If there is a way of bringing these together, the apparent helplessness of the current generation of open weights models might prove misleading.
Post-training is currently a combination of techniques that use synthetic data and human labeled data. Human labeled data significantly improves quality, but its collection is slow and scales poorly. Synthetic data is an increasingly useful aspect of post-training, and automated aspects of its generation scale easily. Unlike weaker models, GPT-4 level LLMs clearly pass reading comprehension on most occasions, OpenAI o1 improves on this further. This suggests that at some point human data might become mostly unnecessary in post-training, even if it still slightly helps. Without it, post-training becomes automated and gets to use more compute, while avoiding the need for costly and complicated human labeling.
A pretrained model at the next level of scale, such as Llama 4, if made available in open weights, might initially look approximately as tame as current models. OpenAI o1 demonstrates that useful post-training for long sequences of System 2 reasoning is possible. In the case of o1 in particular, this might involve a lot of human labeling, making its reproduction a very complicated process (at least if the relevant datasets are not released, and the reasoning traces themselves are not leaked in large quantities). But if some generally available chatbots at the next level of scale are good enough at automating labeling, this complication could be sidestepped, with o1 style post-training cheaply reproduced on top of a previously released open weights model.
So there is an overhang in an open weights model that’s distributed without long horizon reasoning post-training, since applying such post-training significantly improves its capabilities, making perception of its prior capabilities inadequate. The problem right now is that a new level of pretraining scale is approaching in the coming months, while ability to cheaply apply long horizon reasoning post-training might follow shortly thereafter, possibly unlocked by these very same models at the new level of pretraining scale (since it might currently be too expensive for most actors to implement, or to do enough experiments to figure out how). The resulting level of capabilities is currently unknown, and could well remain unknown outside the leading labs until after the enabling artifacts of the open weights pretrained models at the next level of scale have already been published.
I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.
Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.
I can not see any 1o improvement on this.
Is this a well-known phenomenon, or am I onto something significant here?
Are you saying that o1 did not do any better than 5-6% on your AIME-equivalent dataset? That would be interesting given that o1 did far better on the 2024 AIME which presumably was released after the training cutoff: https://openai.com/index/learning-to-reason-with-llms/
How did you translated the dataset, and what is the translation quality?
Yes. And: This isn’t the only low-hanging fruit that will flywheel LLMs to better abilities. I mention some others briefly in “Real AGI” and in more depth in Capabilities and alignment of LLM cognitive architectures.
These are additional cognitive abilities that we know help power human cognitive abilities. There may well be other routes to advancing LLM agents’ intelligence.
Hoping this route doesn’t reach AGI soon is fine- but we should probably do some more thinking about what happens if LLM agents do reach proto-competent-autonomous AGI soon.
The point isn’t particularly that it’s low-hanging fruit or that this is going to happen with other LLMs soon. I expect that counterfactually System 2 reasoning likely happens soon even with no o1, made easy and thereby inevitable merely by further scaling of LLMs, so the somewhat suprising fact that it works already doesn’t significantly move my timelines.
The issue I’m pointing out is timing, a possible delay between when base models at the next level of scale get published in open weights, that don’t yet have o1-like System 2 reasoning (Llama 4 seems the most likely specific model like that to come out next year), and a bit later when it becomes feasible to apply post-training for o1-like System 2 reasoning to these base models.
In the interim, the decisions to publish open weights would be governed by the capabilities without System 2 reasoning, and so they won’t be informed decisions. It would be very easy to justify decisions to publish even in the face of third party evaluations, since those evaluations won’t be themselves applying o1-like post-training to the model that doesn’t already have it, in order to evaluate its resulting capabilities. But then a few months later, there is enough know-how in the open to do that, and capabilities cross all the thresholds that would’ve triggered in those evaluations, but didn’t, since o1-like post-training wasn’t yet commoditized at the time they were done.
Right. I should’ve emphasized the time-lag component. I guess I’ve been taking that for granted since I think primarily in terms of LLM cognitive architectures, not LLMs, as the danger.
The existence of other low-hanging fruit makes that situation worse. Even once o1-like post-training is commoditized and part of testing, there will be other cognitive capabilities with the potential to add dangerous capabilities.
In particular, the addition of useful episodic memory or other forms of continuous learning may have a nonlinear contribution to capabilities. Such learning already exists in at least two forms. Both of those and others are likely being improved as we speak.
Interesting, many of these things seem important as evaluation issues, even as I don’t think they are important algorithmic bottlenecks between now and superintelligence (because either they quickly fall by default, or else if only great effort gets them to work then they still won’t crucially help). So there are blind spots in evaluating open weights models that get more impactful with greater scale of pretraining, and less impactful with more algorithmic progress, which enables evaluations to see better.
Compute and funding could be important bottlenecks for multiple years, if $30 billion training runs don’t cut it. The fact that o1 is possible already (if not o1 itself) might actually be important for timelines, but indirectly, by enabling more funding for compute that cuts through the hypothetical thresholds of capability that would otherwise require significantly better hardware or algorithms.
Depending on how far inference scaling laws go, the situation might be worse still. Picture LLama-4-o1 scaffolds that anyone can run for indefinite amounts of time (as long as they have the money/compute) to autonomously do ML research on various ways to improve Llama-4-o1 and open-weights descendants, to potentially be again appliable to autonomous ML research. Fortunately, lack of access to enough quantities of compute for pretraining the next-gen model is probably a barrier for most actors, but this still seems like a pretty (increasingly, with every open-weights improvement) scary situation to be in.
Now I’m picturing that, and I don’t like it.
Excellent point that these capabilities will contribute to more advancements that will compound rates of progress.
Worse yet, I have it on good authority that a technique much like o1 is thought to use, can be done at very low cost and low human effort on open-source models. It’s unclear how effective it is at those low levels of cost and effort, but definitely useful, so likely scalable to intermediate levels of project.
Here’s hoping that the terrifying acceleration of proliferation and progress is balanced by the inherent ease of aligning LLM agents, and by the relatively slow takeoff speeds, giving us at least a couple of years to get our shit halfway together, including with the automated alignment techniques you focus on.
Related:
And more explicitly, from GDM’s Frontier Safety Framework, pages 5-6:
Image link is broken.
There is research suggesting that, when the amount of synthetic (AI-generated) data reaches some critical point, a complete degeneration of a model happens (“AI dementia”), and that “organic”, human-generated data is, in fact, crucial not only in training an initial model, but in maintaining model “intelligence” in later generations of a model.
So it may be vice versa, that a human input is infinitely more valuable and a need for quality human input will only grow with time.
They even managed to publish it in Nature. But if you don’t throw out the original data and instead train on both the original data and the generated data, this doesn’t seem to happen (see also). Besides, there is the empirical observation that o1 in fact works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1′s methodology, which subsequently makes that model much stronger in a way that wasn’t accounted for when deciding to release that open weights model.
AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn’t crucial) use maybe 10,000 times less natural data than Llama-3-405B (15 trillion tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.
Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.
Yeah, I’m extremely skeptical of the paper, and to point out one of it’s flaws, if you don’t throw away the original data, that doesn’t happen.
There are almost certainly other assumptions that are likely wrong, but that alone makes me extremely skeptical of the model collapse concept, and while a lot of people want it to be true, there is no reason to believe it’s truth.
Yup; there will also likely be a flywheel from automated ML research, e.g. https://sakana.ai/ai-scientist/ whose code is also open-source. Notably, o1 seemed to improve most at math and code, which seem some of the most relevant skills for automated ML research. And it’s not clear whether other parts of the workflow will be blockers either, with some recent results of human-comparable automated ideation and reviewing.
You’ve linked to this Sakana AI paper like 8 times in the last week. IMO, please stop, it’s complete bunk, basically a scam.
I don’t think it being bunk is really any evidence against automated ML research becoming a thing soon, or even already a thing, but the fact that you keep linking to it while ignoring the huge errors in it, and pretend it proves some point, is frustrating.
As someone who dislikes the hype over Sakana AI, and agrees that Bogdan should stop linking it so much, I think that it’s less of a scam, and more like an overhyped product that was not ready for primetime or the discussion it got.
The discourse was not good on Sakana AI, but I do think that it has some uses, just not nearly as much as people want Sakana AI to be.
Well known complete bunk can be useful for gesturing at an idea, even as it gives no evidence about related facts. It can be a good explanatory tool when there is little risk that people will take away related invalid inferences.
(Someone strong-downvoted Bogdan’s comment, which I opposed with a strong-upvote, since it doesn’t by itself seem to be committing the error of believing the Sakana hype, and it gates my reply that I don’t want to be hidden because the comment it happens to be in reply to gets to the deep negatives in Karma.)
To maybe further clarify, I think of the Sakana paper roughly like how I think of autoGPT. LM agents were overhyped initially and autoGPT specifically didn’t work anywhere near as well as some people expected. But I expect LM agents as a whole will be a huge deal.
I genuinely don’t know what you’re referring to.
Fwiw, I’m linking to it because I think it’s the first/clearest demo of how the entire ML research workflow (e.g. see figure 1 in the arxiv) can plausibly be automated using LM agents, and they show a proof of concept which arguably already does something (in any case, it works better than I’d have expected it to). If you know of a better reference, I’d be happy to point to that instead/alternately. Similarly if you can ‘debunk it’ (I don’t think it’s been anywhere near debunked).
We had this conversation two weeks ago?
https://www.lesswrong.com/posts/rQDCQxuCRrrN4ujAe/jeremy-gillen-s-shortform?commentId=TXePXoEosJmAbMZSk
I thought you meant the AI scientist paper has some obvious (e.g. methodological or code) flaws or errors. I find that thread unconvincing, but we’ve been over this.
It’s not necessarily at all impactful. The crucial question for the next few years is whether and where LLM scaling plateaus. Before o1, GPT-4 level models couldn’t produce useful reasoning traces that are very long. Reading comprehension just only started mostly working at this scale. And RLHF through PPO is apparently to a large extent a game of carefully balancing early stopping. So it’s brittle, doesn’t generalize off-distribution very far, which made it unclear if heavier RL can help with System 2 reasoning where it’s not capable of rebuilding the capabilities from scratch, overwriting the damage it does to the LLM.
But taking base model scaling one step further should help both with fragility of GPT-4 level models, and with their ability to carry out System 2 reasoning on their own, with a little bit of adaptation that elicits, rather than heavier RL that instills capabilities not already found to a useful extent in the base model. And then there’s at least one more step of base model scaling after that (deployed models are about 40 megawatts of H100s, models in training about 150 megawatts, in 1.5 years we’ll get to about a gigawatt, with 2x in FLOPs from moving to B200s and whatever effective FLOP/joule Trillium delivers). So there’s every chance this advancement is rendered obsolete, to the extent that it’s currently observable as a capability improvement, if these scaled up models just start being competent at System 2 reasoning without needing such post-training. Even ability of cheaper models to reason could then be reconstructed by training them on reasoning traces collected from larger models.
On the other hand, this is potentially a new dimension of scaling, if the RL can do real work and brute force scaling of base models wouldn’t produce a chatbot good enough to reinvent the necessary RL on its own. There is a whole potentially very general pipeline to generative capabilities here. It starts with preference about the outcome (ability to verify an answer, to simultaneously judge aesthetics and meanings of a poem, to see if a pull request really does resolve the issue). Then it proceeds to training a process supervision model that estimates how good individual reasoning steps are as contributions to getting a good outcome, and optimizing the generative model that proposes good reasoning steps. With a few more years of base model scaling, LLMs are probably going to be good enough at evaluating impactful outcomes that they are incapable of producing directly, so this pipeline gets a lot of capabilities to manufacture. If o1′s methodology is already applicable to this, this removes the uncertainty about whether it could’ve been made to start working without delay relative to the underlying scaling of base models. And the scaling of RL might go a long way before running out of exploitable preference signal about the outcomes of reasoning, independently of how the base models are being scaled up.