Do you know of any material that goes into more detail on the RL pre-training of o1?
As far as I know OpenAI has been pretty cagey about how o1 was trained, but there seems to be a general belief that they took the approach they had described in 2023 in ‘Improving mathematical reasoning with process supervision’ (although I wouldn’t think of that as pre-training).
It might well turn out to be a far harder problem than language-bound reasoning. You seem to have a different view and I’d be very interested in what underpins your conclusions.
I can at least gesture at some of what’s shaping my model here:
Roughly paraphrasing Ilya Sutskever (and Yudkowsky): in order to fully predict text, you have to understand the causal processes that created it; this includes human minds and the physical world that they live in.
The same strategy of self-supervised token-prediction seems to work quite well to extend language models to multimodal abilities up to and including generating video that shows an understanding of physics. I’m told that it’s doing pretty well for robots too, although I haven’t followed that literature.
We know that models which only see text nonetheless build internal world models like globes and game boards.
Proponents of the view that LLMs are just applying shallow statistical patterns to the regularities of language have made predictions based on that view that have failed repeatedly, such as the claim that no pure LLM would ever able to correctly complete Three plus five equals. Over and over we’ve seen predictions about what LLMs would never be able to do turn out to be false, usually not long thereafter (including the ones I mention in my post here). At a certain point that view just stops seeming very plausible.
I think your intuition here is one that’s widely shared (and certainly seemed plausible to me for a while). But when we cash that out into concrete claims, those don’t seem to hold up very well. If you have some ideas about specific limitations that LLMs couldn’t overcome based on that intuition (ideally ones that we can get an answer to in the relatively near future), I’d be interested to hear them.
First, I want to make clear that I don’t believe LLMs to be just stochastic parrots, nor do I doubt that they are capable of world modeling. And you are right to request some more specifically stated beliefs and predictions. In this comment, I attempted to improve on this, with limited success.
There are two main pillars in my world model that make me, even in light of the massive gains in capabilities we have seen in the last seven year, still skeptical of transformer architecture scaling straight to AGI.
Compute overhangs and algorithmic overhangs are regularly talked about. My belief is that a data overhang played a significant role in the success of transformer architecture.
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
On the first point: My model of the world circa 2017 looks like this. There’s a massive data overhang, which in a certain sense took humanity all of history to create. A special kind of data, refined over many human generations of “thinking work”, crystalized intelligence. But also with distinct blind spots. Some things are hard to capture with the available media, others we just didn’t much care to document.
Then transformer architecture comes around, is uniquely suited to extract the insights embedded in this data. Maybe better than the brains that created it in the first place. At the very least it scales in a way that brains can’t. More compute makes more of this data overhang accessible, leading to massive capability gains from model to model.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
On the second point: Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind? I’m reminded of similar problems in the research of animal intelligence.
Again, it is clear there’s some kind of world model in the LLM, but less so how much this kind of research predicts about its potential (lack of) shortcomings.
However, this is still all rather vague; let me try to formulate some predictions which could plausibly be checked in the next year or so.
Predictions:
The world models of LLMs are impoverished in weird ways compared to humans, due to blind spots in the training data. An example would be tactile sensations, which seem to play an important role in the intuitive modeling of physics for humans. Solving some of the blind spots is critical for further capability gains.
To elicit further capability gains, it will become necessary to turn to data which is less well-suited for transformer architecture. This will lead to escalating compute requirements, the effects of which will already become apparent in 2025.
As a result, there will be even stronger incentives for:
Combining different ML architectures, including transformers, and classical software into compound systems. We currently call this scaffolding, but transformers will become less prominent in these. “LLMs plus some scaffolding” will not be an accurate description of the systems that solve the next batch of hard problems.
Developing completely new architecture, with a certain chance of another “Attention Is All You Need”, a new approach gaining the kind of eminence that transformers currently have. The likelihood and necessity of this is obviously a crux, currently I lean towards a. being sufficient for AGI even in the absence of another groundbreaking discovery.
Automated original ML research will turn out to be one of the hard problems that require 3.a or b. Transformer architecture will not create its own scaffolding or successor.
Now, your comment prompted me to look more deeply into the current state of machine learning in robotics and the success of decision transformers and even more so behaviour transformers disagree with my predictions.
Compound systems, yes. But clearly transformers have an outsized impact on the results, and they handled data which I would have filed under “not well-suited” just fine. For now, I’ll stick with my predictions, if only for the sake of accountability. But evidentlyit’stime for some more reading.
[EDIT: I originally gave an excessively long and detailed response to your predictions. That version is preserved (& commentable) here in case it’s of interest]
I applaud your willingness to give predictions! Some of them seem useful but others don’t differ from what the opposing view would predict. Specifically:
I think most people would agree that there are blind spots; LLMs have and will continue to have a different balance of strengths and weaknesses from humans. You seem to say that those blind spots will block capability gains in general; that seems unlikely to me (and it would shift me toward your view if it clearly happened) although I agree they could get in the way of certain specific capability gains.
The need for escalating compute seems like it’ll happen either way, so I don’t think this prediction provides evidence on your view vs the other.
Transformers not being the main cognitive component of scaffolded systems seems like a good prediction. I expect that to happen for some systems regardless, but I expect LLMs to be the cognitive core for most, until a substantially better architecture is found, and it will shift me a bit toward your view if that isn’t the case. I do think we’ll eventually see such an architectural breakthrough regardless of whether your view is correct, so I think that seeing a breakthrough won’t provide useful evidence.
‘LLM-centric systems can’t do novel ML research’ seems like a valuable prediction; if it proves true, that would shift me toward your view.
First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I’ve been working on a research project that’s intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I’d be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you’d find more compelling?
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
Absolutely! And then on top of that, it’s very easy to mistake using knowledge from the truly vast training data for actual reasoning.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind?
I think that’s a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don’t know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me
Epoch’s analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that’s just for text—there are also other significant sources of data for multimodal models, eg video)
As far as I know OpenAI has been pretty cagey about how o1 was trained, but there seems to be a general belief that they took the approach they had described in 2023 in ‘Improving mathematical reasoning with process supervision’ (although I wouldn’t think of that as pre-training).
I can at least gesture at some of what’s shaping my model here:
Roughly paraphrasing Ilya Sutskever (and Yudkowsky): in order to fully predict text, you have to understand the causal processes that created it; this includes human minds and the physical world that they live in.
The same strategy of self-supervised token-prediction seems to work quite well to extend language models to multimodal abilities up to and including generating video that shows an understanding of physics. I’m told that it’s doing pretty well for robots too, although I haven’t followed that literature.
We know that models which only see text nonetheless build internal world models like globes and game boards.
Proponents of the view that LLMs are just applying shallow statistical patterns to the regularities of language have made predictions based on that view that have failed repeatedly, such as the claim that no pure LLM would ever able to correctly complete Three plus five equals. Over and over we’ve seen predictions about what LLMs would never be able to do turn out to be false, usually not long thereafter (including the ones I mention in my post here). At a certain point that view just stops seeming very plausible.
I think your intuition here is one that’s widely shared (and certainly seemed plausible to me for a while). But when we cash that out into concrete claims, those don’t seem to hold up very well. If you have some ideas about specific limitations that LLMs couldn’t overcome based on that intuition (ideally ones that we can get an answer to in the relatively near future), I’d be interested to hear them.
Hey, thanks for taking the time to answer!
First, I want to make clear that I don’t believe LLMs to be just stochastic parrots, nor do I doubt that they are capable of world modeling. And you are right to request some more specifically stated beliefs and predictions. In this comment, I attempted to improve on this, with limited success.
There are two main pillars in my world model that make me, even in light of the massive gains in capabilities we have seen in the last seven year, still skeptical of transformer architecture scaling straight to AGI.
Compute overhangs and algorithmic overhangs are regularly talked about. My belief is that a data overhang played a significant role in the success of transformer architecture.
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
On the first point:
My model of the world circa 2017 looks like this. There’s a massive data overhang, which in a certain sense took humanity all of history to create. A special kind of data, refined over many human generations of “thinking work”, crystalized intelligence. But also with distinct blind spots. Some things are hard to capture with the available media, others we just didn’t much care to document.
Then transformer architecture comes around, is uniquely suited to extract the insights embedded in this data. Maybe better than the brains that created it in the first place. At the very least it scales in a way that brains can’t. More compute makes more of this data overhang accessible, leading to massive capability gains from model to model.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
On the second point:
Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind? I’m reminded of similar problems in the research of animal intelligence.
Again, it is clear there’s some kind of world model in the LLM, but less so how much this kind of research predicts about its potential (lack of) shortcomings.
However, this is still all rather vague; let me try to formulate some predictions which could plausibly be checked in the next year or so.
Predictions:
The world models of LLMs are impoverished in weird ways compared to humans, due to blind spots in the training data. An example would be tactile sensations, which seem to play an important role in the intuitive modeling of physics for humans. Solving some of the blind spots is critical for further capability gains.
To elicit further capability gains, it will become necessary to turn to data which is less well-suited for transformer architecture. This will lead to escalating compute requirements, the effects of which will already become apparent in 2025.
As a result, there will be even stronger incentives for:
Combining different ML architectures, including transformers, and classical software into compound systems. We currently call this scaffolding, but transformers will become less prominent in these. “LLMs plus some scaffolding” will not be an accurate description of the systems that solve the next batch of hard problems.
Developing completely new architecture, with a certain chance of another “Attention Is All You Need”, a new approach gaining the kind of eminence that transformers currently have. The likelihood and necessity of this is obviously a crux, currently I lean towards a. being sufficient for AGI even in the absence of another groundbreaking discovery.
Automated original ML research will turn out to be one of the hard problems that require 3.a or b. Transformer architecture will not create its own scaffolding or successor.
Now, your comment prompted me to look more deeply into the current state of machine learning in robotics and the success of decision transformers and even more so behaviour transformers disagree with my predictions.
Examples:
https://arxiv.org/abs/2206.11251
https://sjlee.cc/vq-bet/
https://youtu.be/5_G6o_H3HeE?si=JOsTGvQ17ZfdIdAJ
Compound systems, yes. But clearly transformers have an outsized impact on the results, and they handled data which I would have filed under “not well-suited” just fine. For now, I’ll stick with my predictions, if only for the sake of accountability. But evidently it’s time for some more reading.
[EDIT: I originally gave an excessively long and detailed response to your predictions. That version is preserved (& commentable) here in case it’s of interest]
I applaud your willingness to give predictions! Some of them seem useful but others don’t differ from what the opposing view would predict. Specifically:
I think most people would agree that there are blind spots; LLMs have and will continue to have a different balance of strengths and weaknesses from humans. You seem to say that those blind spots will block capability gains in general; that seems unlikely to me (and it would shift me toward your view if it clearly happened) although I agree they could get in the way of certain specific capability gains.
The need for escalating compute seems like it’ll happen either way, so I don’t think this prediction provides evidence on your view vs the other.
Transformers not being the main cognitive component of scaffolded systems seems like a good prediction. I expect that to happen for some systems regardless, but I expect LLMs to be the cognitive core for most, until a substantially better architecture is found, and it will shift me a bit toward your view if that isn’t the case. I do think we’ll eventually see such an architectural breakthrough regardless of whether your view is correct, so I think that seeing a breakthrough won’t provide useful evidence.
‘LLM-centric systems can’t do novel ML research’ seems like a valuable prediction; if it proves true, that would shift me toward your view.
First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I’ve been working on a research project that’s intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I’d be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you’d find more compelling?
Absolutely! And then on top of that, it’s very easy to mistake using knowledge from the truly vast training data for actual reasoning.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
I think that’s a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don’t know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
Epoch’s analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that’s just for text—there are also other significant sources of data for multimodal models, eg video)