I think o1 is a partial realization of your thesis, and the only reason it’s not more successful is because the compute used for GPT-o1 and GPT-4o were essentially the same:
As far as I can tell Strawberry is proving me right: it’s going beyond pre-training and scales inference—the obvious next step.
A lot of people said just scaling pre-trained transformers would scale to AGI. I think that’s silly and doesn’t make sense. But now you don’t have to believe me—you can just use OpenAIs latest model.
The next step is to do efficient long-horizon RL for data-sparse domains.
Strawberry working suggest that this might not be so hard. Don’t be fooled by the modest gains of Strawberry so far. This is a new paradigm that is heading us toward true AGI and superintelligence.
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is “does this basic kind of thing work” and given that the answer appears to be “yes” we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I’d love to understand better how the RL training for CoT changes the representations learned during pre-training.
in my reading, Strawberry is showing that indeed scaling just pretraining transformers will *not* lead to AGI. The new paradigm is inference-scaling—the obvious next step is doing RL on long horizons and sparse data domains. I have been saying this ever since gpt-3 came out.
For the question of general intelligence imho the scaling is conceptually a red herring: any (general purpose) algorithm will do better when scaled. The key in my mind is the algorithm not the resource, just like I would say a child is generally intelligent while a pocket calculator is not even if the child can’t count to 20 yet. It’s about the meta-capability to learn not the capability.
As we spoke earlier—it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed—it suggests longer horizons might be easy too. This means superintelligence within two years isnot out of the question.
We have been shown that this search algorithm works, and we not yet have been shown that the other approaches don’t work.
Remember, technological development is disjunctive, and just because you’ve shown that 1 approach works, doesn’t mean that we have been shown that only that approach works.
Of course, people will absolutely try to scale this one up now that they found success, and I think that timelines have definitely been shortened, but remember that AI progress is closer to a disjunctive scenario than conjunctive scenario:
I agree with this quote below, but I wanted to point out the disjunctiveness of AI progress:
As we spoke earlier—it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed—it suggests longer horizons might be easy too. This means superintelligence within two years is not out of the question.
strong disagree. i would be highly surprised if there were multiple essentially different algorithms to achieve general intelligence*.
I also agree with the Daniel Murfet’s quote. There is a difference between a disjunction before you see the data and a disjunction after you see the data. I agree AI development is disjunctive before you see the data—but in hindsight all the things that work are really minor variants on a single thing that works.
*of course “essentially different” is doing a lot of work here. some of the conceptual foundations of intelligence haven’t been worked out enough (or Vanessa has and I don’t understand it yet) for me to make a formal statement here.
Re different algorithms, I actually agree with both you and Daniel Murfet in that conditional on non-reversible computers, there is at most 1-3 algorithms to achieve intelligence that can scale arbitrarily large, and I’m closer to 1 than 3 here.
But once reversible computers/superconducting wires are allowed, all bets are off on how many algorithms are allowed, because you can have far, far more computation with far, far less waste heat leaving, and a lot of the design of computers is due to heat requirements.
Reversible computing and superconducting wires seem like hardware innovations. You are saying that this will actually materially change the nature of the algorithm you’d want to run?
I’d bet against. I’d be surprised if this was the case. As far as I can tell everything we have so seen so far points to a common simple core of general intelligence algorithm (basically an open-loop RL algorithm on top of a pre-trained transformers). I’d be surprised if there were materially different ways to do this. One of the main takeaways of the last decade of deep learning process is just how little architecture matters—it’s almost all data and compute (plus I claim one extra ingredient, open-loop RL that is efficient on long horizons and sparse data novel domains)
I don’t know for certain of course. If I look at theoretical CS though the universality of computation makes me skeptical of radically different algorithms.
I think o1 is a partial realization of your thesis, and the only reason it’s not more successful is because the compute used for GPT-o1 and GPT-4o were essentially the same:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1
And yeah, the search part was actually quite good, if a bit modest in it’s gains.
As far as I can tell Strawberry is proving me right: it’s going beyond pre-training and scales inference—the obvious next step.
A lot of people said just scaling pre-trained transformers would scale to AGI. I think that’s silly and doesn’t make sense. But now you don’t have to believe me—you can just use OpenAIs latest model.
The next step is to do efficient long-horizon RL for data-sparse domains.
Strawberry working suggest that this might not be so hard. Don’t be fooled by the modest gains of Strawberry so far. This is a new paradigm that is heading us toward true AGI and superintelligence.
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is “does this basic kind of thing work” and given that the answer appears to be “yes” we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I’d love to understand better how the RL training for CoT changes the representations learned during pre-training.
in my reading, Strawberry is showing that indeed scaling just pretraining transformers will *not* lead to AGI. The new paradigm is inference-scaling—the obvious next step is doing RL on long horizons and sparse data domains. I have been saying this ever since gpt-3 came out.
For the question of general intelligence imho the scaling is conceptually a red herring: any (general purpose) algorithm will do better when scaled. The key in my mind is the algorithm not the resource, just like I would say a child is generally intelligent while a pocket calculator is not even if the child can’t count to 20 yet. It’s about the meta-capability to learn not the capability.
As we spoke earlier—it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed—it suggests longer horizons might be easy too. This means superintelligence within two years is not out of the question.
We have been shown that this search algorithm works, and we not yet have been shown that the other approaches don’t work.
Remember, technological development is disjunctive, and just because you’ve shown that 1 approach works, doesn’t mean that we have been shown that only that approach works.
Of course, people will absolutely try to scale this one up now that they found success, and I think that timelines have definitely been shortened, but remember that AI progress is closer to a disjunctive scenario than conjunctive scenario:
I agree with this quote below, but I wanted to point out the disjunctiveness of AI progress:
https://gwern.net/forking-path
strong disagree. i would be highly surprised if there were multiple essentially different algorithms to achieve general intelligence*.
I also agree with the Daniel Murfet’s quote. There is a difference between a disjunction before you see the data and a disjunction after you see the data. I agree AI development is disjunctive before you see the data—but in hindsight all the things that work are really minor variants on a single thing that works.
*of course “essentially different” is doing a lot of work here. some of the conceptual foundations of intelligence haven’t been worked out enough (or Vanessa has and I don’t understand it yet) for me to make a formal statement here.
Re different algorithms, I actually agree with both you and Daniel Murfet in that conditional on non-reversible computers, there is at most 1-3 algorithms to achieve intelligence that can scale arbitrarily large, and I’m closer to 1 than 3 here.
But once reversible computers/superconducting wires are allowed, all bets are off on how many algorithms are allowed, because you can have far, far more computation with far, far less waste heat leaving, and a lot of the design of computers is due to heat requirements.
Reversible computing and superconducting wires seem like hardware innovations. You are saying that this will actually materially change the nature of the algorithm you’d want to run?
I’d bet against. I’d be surprised if this was the case. As far as I can tell everything we have so seen so far points to a common simple core of general intelligence algorithm (basically an open-loop RL algorithm on top of a pre-trained transformers). I’d be surprised if there were materially different ways to do this. One of the main takeaways of the last decade of deep learning process is just how little architecture matters—it’s almost all data and compute (plus I claim one extra ingredient, open-loop RL that is efficient on long horizons and sparse data novel domains)
I don’t know for certain of course. If I look at theoretical CS though the universality of computation makes me skeptical of radically different algorithms.