I think scaffolding is the wrong metaphor. Sequences of actions, observations and rewards are just more tokens to be modeled, and if I were running Google I would be busy instructing all work units to start packaging up such sequences of tokens to feed into the training runs for Gemini models. Many seemingly minor tasks (e.g. app recommendation in the Play store) either have, or could have, components of RL built into the pipeline, and could benefit from incorporating LLMs, either by putting the RL task in-context or through fine-tuning of very fast cheap models.
So when I say I don’t see a distinction between LLMs and “short term planning agents” I mean that we already know how to subsume RL tasks into next token prediction, and so there is in some technical sense already no distinction. It’s a question of how the underlying capabilities are packaged and deployed, and I think that within 6-12 months there will be many internal deployments of LLMs doing short sequences of tasks within Google. If that works, then it seems very natural to just scale up sequence length as generalisation improves.
Arguably fine-tuning a next-token predictor on action, observation, reward sequences, or doing it in-context, is inferior to using algorithms like PPO. However, the advantage of knowledge transfer from the rest of the next-token predictor’s data distribution may more than compensate for this on some short-term tasks.
I think o1 is a partial realization of your thesis, and the only reason it’s not more successful is because the compute used for GPT-o1 and GPT-4o were essentially the same:
As far as I can tell Strawberry is proving me right: it’s going beyond pre-training and scales inference—the obvious next step.
A lot of people said just scaling pre-trained transformers would scale to AGI. I think that’s silly and doesn’t make sense. But now you don’t have to believe me—you can just use OpenAIs latest model.
The next step is to do efficient long-horizon RL for data-sparse domains.
Strawberry working suggest that this might not be so hard. Don’t be fooled by the modest gains of Strawberry so far. This is a new paradigm that is heading us toward true AGI and superintelligence.
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is “does this basic kind of thing work” and given that the answer appears to be “yes” we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I’d love to understand better how the RL training for CoT changes the representations learned during pre-training.
in my reading, Strawberry is showing that indeed scaling just pretraining transformers will *not* lead to AGI. The new paradigm is inference-scaling—the obvious next step is doing RL on long horizons and sparse data domains. I have been saying this ever since gpt-3 came out.
For the question of general intelligence imho the scaling is conceptually a red herring: any (general purpose) algorithm will do better when scaled. The key in my mind is the algorithm not the resource, just like I would say a child is generally intelligent while a pocket calculator is not even if the child can’t count to 20 yet. It’s about the meta-capability to learn not the capability.
As we spoke earlier—it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed—it suggests longer horizons might be easy too. This means superintelligence within two years isnot out of the question.
We have been shown that this search algorithm works, and we not yet have been shown that the other approaches don’t work.
Remember, technological development is disjunctive, and just because you’ve shown that 1 approach works, doesn’t mean that we have been shown that only that approach works.
Of course, people will absolutely try to scale this one up now that they found success, and I think that timelines have definitely been shortened, but remember that AI progress is closer to a disjunctive scenario than conjunctive scenario:
I agree with this quote below, but I wanted to point out the disjunctiveness of AI progress:
As we spoke earlier—it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed—it suggests longer horizons might be easy too. This means superintelligence within two years is not out of the question.
strong disagree. i would be highly surprised if there were multiple essentially different algorithms to achieve general intelligence*.
I also agree with the Daniel Murfet’s quote. There is a difference between a disjunction before you see the data and a disjunction after you see the data. I agree AI development is disjunctive before you see the data—but in hindsight all the things that work are really minor variants on a single thing that works.
*of course “essentially different” is doing a lot of work here. some of the conceptual foundations of intelligence haven’t been worked out enough (or Vanessa has and I don’t understand it yet) for me to make a formal statement here.
Re different algorithms, I actually agree with both you and Daniel Murfet in that conditional on non-reversible computers, there is at most 1-3 algorithms to achieve intelligence that can scale arbitrarily large, and I’m closer to 1 than 3 here.
But once reversible computers/superconducting wires are allowed, all bets are off on how many algorithms are allowed, because you can have far, far more computation with far, far less waste heat leaving, and a lot of the design of computers is due to heat requirements.
Reversible computing and superconducting wires seem like hardware innovations. You are saying that this will actually materially change the nature of the algorithm you’d want to run?
I’d bet against. I’d be surprised if this was the case. As far as I can tell everything we have so seen so far points to a common simple core of general intelligence algorithm (basically an open-loop RL algorithm on top of a pre-trained transformers). I’d be surprised if there were materially different ways to do this. One of the main takeaways of the last decade of deep learning process is just how little architecture matters—it’s almost all data and compute (plus I claim one extra ingredient, open-loop RL that is efficient on long horizons and sparse data novel domains)
I don’t know for certain of course. If I look at theoretical CS though the universality of computation makes me skeptical of radically different algorithms.
I think scaffolding is the wrong metaphor. Sequences of actions, observations and rewards are just more tokens to be modeled, and if I were running Google I would be busy instructing all work units to start packaging up such sequences of tokens to feed into the training runs for Gemini models. Many seemingly minor tasks (e.g. app recommendation in the Play store) either have, or could have, components of RL built into the pipeline, and could benefit from incorporating LLMs, either by putting the RL task in-context or through fine-tuning of very fast cheap models.
So when I say I don’t see a distinction between LLMs and “short term planning agents” I mean that we already know how to subsume RL tasks into next token prediction, and so there is in some technical sense already no distinction. It’s a question of how the underlying capabilities are packaged and deployed, and I think that within 6-12 months there will be many internal deployments of LLMs doing short sequences of tasks within Google. If that works, then it seems very natural to just scale up sequence length as generalisation improves.
Arguably fine-tuning a next-token predictor on action, observation, reward sequences, or doing it in-context, is inferior to using algorithms like PPO. However, the advantage of knowledge transfer from the rest of the next-token predictor’s data distribution may more than compensate for this on some short-term tasks.
I think o1 is a partial realization of your thesis, and the only reason it’s not more successful is because the compute used for GPT-o1 and GPT-4o were essentially the same:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1
And yeah, the search part was actually quite good, if a bit modest in it’s gains.
As far as I can tell Strawberry is proving me right: it’s going beyond pre-training and scales inference—the obvious next step.
A lot of people said just scaling pre-trained transformers would scale to AGI. I think that’s silly and doesn’t make sense. But now you don’t have to believe me—you can just use OpenAIs latest model.
The next step is to do efficient long-horizon RL for data-sparse domains.
Strawberry working suggest that this might not be so hard. Don’t be fooled by the modest gains of Strawberry so far. This is a new paradigm that is heading us toward true AGI and superintelligence.
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is “does this basic kind of thing work” and given that the answer appears to be “yes” we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I’d love to understand better how the RL training for CoT changes the representations learned during pre-training.
in my reading, Strawberry is showing that indeed scaling just pretraining transformers will *not* lead to AGI. The new paradigm is inference-scaling—the obvious next step is doing RL on long horizons and sparse data domains. I have been saying this ever since gpt-3 came out.
For the question of general intelligence imho the scaling is conceptually a red herring: any (general purpose) algorithm will do better when scaled. The key in my mind is the algorithm not the resource, just like I would say a child is generally intelligent while a pocket calculator is not even if the child can’t count to 20 yet. It’s about the meta-capability to learn not the capability.
As we spoke earlier—it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed—it suggests longer horizons might be easy too. This means superintelligence within two years is not out of the question.
We have been shown that this search algorithm works, and we not yet have been shown that the other approaches don’t work.
Remember, technological development is disjunctive, and just because you’ve shown that 1 approach works, doesn’t mean that we have been shown that only that approach works.
Of course, people will absolutely try to scale this one up now that they found success, and I think that timelines have definitely been shortened, but remember that AI progress is closer to a disjunctive scenario than conjunctive scenario:
I agree with this quote below, but I wanted to point out the disjunctiveness of AI progress:
https://gwern.net/forking-path
strong disagree. i would be highly surprised if there were multiple essentially different algorithms to achieve general intelligence*.
I also agree with the Daniel Murfet’s quote. There is a difference between a disjunction before you see the data and a disjunction after you see the data. I agree AI development is disjunctive before you see the data—but in hindsight all the things that work are really minor variants on a single thing that works.
*of course “essentially different” is doing a lot of work here. some of the conceptual foundations of intelligence haven’t been worked out enough (or Vanessa has and I don’t understand it yet) for me to make a formal statement here.
Re different algorithms, I actually agree with both you and Daniel Murfet in that conditional on non-reversible computers, there is at most 1-3 algorithms to achieve intelligence that can scale arbitrarily large, and I’m closer to 1 than 3 here.
But once reversible computers/superconducting wires are allowed, all bets are off on how many algorithms are allowed, because you can have far, far more computation with far, far less waste heat leaving, and a lot of the design of computers is due to heat requirements.
Reversible computing and superconducting wires seem like hardware innovations. You are saying that this will actually materially change the nature of the algorithm you’d want to run?
I’d bet against. I’d be surprised if this was the case. As far as I can tell everything we have so seen so far points to a common simple core of general intelligence algorithm (basically an open-loop RL algorithm on top of a pre-trained transformers). I’d be surprised if there were materially different ways to do this. One of the main takeaways of the last decade of deep learning process is just how little architecture matters—it’s almost all data and compute (plus I claim one extra ingredient, open-loop RL that is efficient on long horizons and sparse data novel domains)
I don’t know for certain of course. If I look at theoretical CS though the universality of computation makes me skeptical of radically different algorithms.