The point isn’t particularly that it’s low-hanging fruit or that this is going to happen with other LLMs soon. I expect that counterfactually System 2 reasoning likely happens soon even with no o1, made easy and thereby inevitable merely by further scaling of LLMs, so the somewhat suprising fact that it works already doesn’t significantly move my timelines.
The issue I’m pointing out is timing, a possible delay between when base models at the next level of scale get published in open weights, that don’t yet have o1-like System 2 reasoning (Llama 4 seems the most likely specific model like that to come out next year), and a bit later when it becomes feasible to apply post-training for o1-like System 2 reasoning to these base models.
In the interim, the decisions to publish open weights would be governed by the capabilities without System 2 reasoning, and so they won’t be informed decisions. It would be very easy to justify decisions to publish even in the face of third party evaluations, since those evaluations won’t be themselves applying o1-like post-training to the model that doesn’t already have it, in order to evaluate its resulting capabilities. But then a few months later, there is enough know-how in the open to do that, and capabilities cross all the thresholds that would’ve triggered in those evaluations, but didn’t, since o1-like post-training wasn’t yet commoditized at the time they were done.
Right. I should’ve emphasized the time-lag component. I guess I’ve been taking that for granted since I think primarily in terms of LLM cognitive architectures, not LLMs, as the danger.
The existence of other low-hanging fruit makes that situation worse. Even once o1-like post-training is commoditized and part of testing, there will be other cognitive capabilities with the potential to add dangerous capabilities.
In particular, the addition of useful episodic memory or other forms of continuous learning may have a nonlinear contribution to capabilities. Such learning already exists in at least two forms. Both of those and others are likely being improved as we speak.
Depending on how far inference scaling laws go, the situation might be worse still. Picture LLama-4-o1 scaffolds that anyone can run for indefinite amounts of time (as long as they have the money/compute) to autonomously do ML research on various ways to improve Llama-4-o1 and open-weights descendants, to potentially be again appliable to autonomous ML research. Fortunately, lack of access to enough quantities of compute for pretraining the next-gen model is probably a barrier for most actors, but this still seems like a pretty (increasingly, with every open-weights improvement) scary situation to be in.
Excellent point that these capabilities will contribute to more advancements that will compound rates of progress.
Worse yet, I have it on good authority that a technique much like o1 is thought to use, can be done at very low cost and low human effort on open-source models. It’s unclear how effective it is at those low levels of cost and effort, but definitely useful, so likely scalable to intermediate levels of project.
Here’s hoping that the terrifying acceleration of proliferation and progress is balanced by the inherent ease of aligning LLM agents, and by the relatively slow takeoff speeds, giving us at least a couple of years to get our shit halfway together, including with the automated alignment techniques you focus on.
Interesting, many of these things seem important as evaluation issues, even as I don’t think they are important algorithmic bottlenecks between now and superintelligence (because either they quickly fall by default, or else if only great effort gets them to work then they still won’t crucially help). So there are blind spots in evaluating open weights models that get more impactful with greater scale of pretraining, and less impactful with more algorithmic progress, which enables evaluations to see better.
Compute and funding could be important bottlenecks for multiple years, if $30 billion training runs don’t cut it. The fact that o1 is possible already (if not o1 itself) might actually be important for timelines, but indirectly, by enabling more funding for compute that cuts through the hypothetical thresholds of capability that would otherwise require significantly better hardware or algorithms.
The point isn’t particularly that it’s low-hanging fruit or that this is going to happen with other LLMs soon. I expect that counterfactually System 2 reasoning likely happens soon even with no o1, made easy and thereby inevitable merely by further scaling of LLMs, so the somewhat suprising fact that it works already doesn’t significantly move my timelines.
The issue I’m pointing out is timing, a possible delay between when base models at the next level of scale get published in open weights, that don’t yet have o1-like System 2 reasoning (Llama 4 seems the most likely specific model like that to come out next year), and a bit later when it becomes feasible to apply post-training for o1-like System 2 reasoning to these base models.
In the interim, the decisions to publish open weights would be governed by the capabilities without System 2 reasoning, and so they won’t be informed decisions. It would be very easy to justify decisions to publish even in the face of third party evaluations, since those evaluations won’t be themselves applying o1-like post-training to the model that doesn’t already have it, in order to evaluate its resulting capabilities. But then a few months later, there is enough know-how in the open to do that, and capabilities cross all the thresholds that would’ve triggered in those evaluations, but didn’t, since o1-like post-training wasn’t yet commoditized at the time they were done.
Right. I should’ve emphasized the time-lag component. I guess I’ve been taking that for granted since I think primarily in terms of LLM cognitive architectures, not LLMs, as the danger.
The existence of other low-hanging fruit makes that situation worse. Even once o1-like post-training is commoditized and part of testing, there will be other cognitive capabilities with the potential to add dangerous capabilities.
In particular, the addition of useful episodic memory or other forms of continuous learning may have a nonlinear contribution to capabilities. Such learning already exists in at least two forms. Both of those and others are likely being improved as we speak.
Depending on how far inference scaling laws go, the situation might be worse still. Picture LLama-4-o1 scaffolds that anyone can run for indefinite amounts of time (as long as they have the money/compute) to autonomously do ML research on various ways to improve Llama-4-o1 and open-weights descendants, to potentially be again appliable to autonomous ML research. Fortunately, lack of access to enough quantities of compute for pretraining the next-gen model is probably a barrier for most actors, but this still seems like a pretty (increasingly, with every open-weights improvement) scary situation to be in.
Now I’m picturing that, and I don’t like it.
Excellent point that these capabilities will contribute to more advancements that will compound rates of progress.
Worse yet, I have it on good authority that a technique much like o1 is thought to use, can be done at very low cost and low human effort on open-source models. It’s unclear how effective it is at those low levels of cost and effort, but definitely useful, so likely scalable to intermediate levels of project.
Here’s hoping that the terrifying acceleration of proliferation and progress is balanced by the inherent ease of aligning LLM agents, and by the relatively slow takeoff speeds, giving us at least a couple of years to get our shit halfway together, including with the automated alignment techniques you focus on.
Related:
And more explicitly, from GDM’s Frontier Safety Framework, pages 5-6:
Image link is broken.
Interesting, many of these things seem important as evaluation issues, even as I don’t think they are important algorithmic bottlenecks between now and superintelligence (because either they quickly fall by default, or else if only great effort gets them to work then they still won’t crucially help). So there are blind spots in evaluating open weights models that get more impactful with greater scale of pretraining, and less impactful with more algorithmic progress, which enables evaluations to see better.
Compute and funding could be important bottlenecks for multiple years, if $30 billion training runs don’t cut it. The fact that o1 is possible already (if not o1 itself) might actually be important for timelines, but indirectly, by enabling more funding for compute that cuts through the hypothetical thresholds of capability that would otherwise require significantly better hardware or algorithms.