Sure, this is useful. To your other posts, I don’t think we’re really disagreeing about what AGI is—I think we’d agree that if you took a model with GPT4-like capabilities and hooked it up to a chess API to reinforce it you would end up with a GPT4 model that’s very good at playing chess, not something that has strongly-improved its general underlying world model and thus would also be able to say improve its LSAT score. And this is what I’m imaging most self-play training would accomplish… but I’m open to being wrong. To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
I’m just trying to underscore that there are lots of tasks which we hope that AGIs would be able to accomplish (eg. solving open math problems) but we probably cannot use RL to directly iterate a model to accomplish this task because we can’t define a gradient of reward that would help define the AGI.
To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
No, nobody has been trying anything for decades that matters. As it turns out, the only thing that matters was scale. So there are 3 companies that had enough money for scale, and they are the only efforts that count, and all combined have done a small enough number of full scale experiments you can count them up with 2 hands. @gwern has expressed the opinion that we probably didn’t even need the transformer, other neural networks likely would have worked at these scales.
As for the rest of it, no, we’re saying at massive scales, we abdicate trying to understand AGI architectures—since they are enormously complex and coupled machined—and just iteratively find some that work by trial and error.
“work” includes generality. The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for. (though this can be tough to filter since obviously it’s simply easier to train on all text in existence).
One that has controlled a robot to manipulate fine wire and many object manip tasks, and one that has passed the exams for a course on electronics, and then first try builds a working circuit in a simulated world is what we’re looking for. So more points on that.
That’s the idea. Define what we want the machine to do and what we mean by “generality”, iterate over the search space a very large number of times. In an unbiased way, pick the most distinct n winners and have those winners propose the next round of AGI designs and so on.
And most of the points for the winners are explicitly for the generality behavior we are seeking.
>As it turns out, the only thing that matters was scale.
I mean, in some sense yes. But AlphaGo wasn’t trained by finding a transcript of every Go game that had ever been played, but instead was trained via self-play RL. But attempts to create general game-playing agents via similar methods haven’t worked out very well, in my understanding. I don’t assume that if we just threw 10x or 100x data at them that this would change...
>The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for.
Yes, but the latter exists and is trained via human reinforcement learning that can’t be translated to self-play. The former doesn’t exist as far as I can tell. I don’t see anyone proposing to improve GPT-4 by turning from HFRL to self-play RL.
Ultimately I think there’s a possibility that the improvements to LLMs from further scaling may not be very large, and instead we’ll need to find some sort of new architecture to create dangerous AGIs.
Gpt-4 did RL feedback that was self evaluation across all the inputs users fed by chatGPT.
Self play would be having it practice leetcode problems with the RL feedback the score.
The software support is there and the RL feedback worked, why do you think it is even evidence to say “obvious thing that works well hasn’t been done yet or maybe it has, openAI won’t say”
There is also a tremendous amount of self play possible now with the new plugin interface.
Sure, this is useful. To your other posts, I don’t think we’re really disagreeing about what AGI is—I think we’d agree that if you took a model with GPT4-like capabilities and hooked it up to a chess API to reinforce it you would end up with a GPT4 model that’s very good at playing chess, not something that has strongly-improved its general underlying world model and thus would also be able to say improve its LSAT score. And this is what I’m imaging most self-play training would accomplish… but I’m open to being wrong. To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
I’m just trying to underscore that there are lots of tasks which we hope that AGIs would be able to accomplish (eg. solving open math problems) but we probably cannot use RL to directly iterate a model to accomplish this task because we can’t define a gradient of reward that would help define the AGI.
To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
No, nobody has been trying anything for decades that matters. As it turns out, the only thing that matters was scale. So there are 3 companies that had enough money for scale, and they are the only efforts that count, and all combined have done a small enough number of full scale experiments you can count them up with 2 hands. @gwern has expressed the opinion that we probably didn’t even need the transformer, other neural networks likely would have worked at these scales.
As for the rest of it, no, we’re saying at massive scales, we abdicate trying to understand AGI architectures—since they are enormously complex and coupled machined—and just iteratively find some that work by trial and error.
“work” includes generality. The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for. (though this can be tough to filter since obviously it’s simply easier to train on all text in existence).
One that has controlled a robot to manipulate fine wire and many object manip tasks, and one that has passed the exams for a course on electronics, and then first try builds a working circuit in a simulated world is what we’re looking for. So more points on that.
That’s the idea. Define what we want the machine to do and what we mean by “generality”, iterate over the search space a very large number of times. In an unbiased way, pick the most distinct n winners and have those winners propose the next round of AGI designs and so on.
And most of the points for the winners are explicitly for the generality behavior we are seeking.
>As it turns out, the only thing that matters was scale.
I mean, in some sense yes. But AlphaGo wasn’t trained by finding a transcript of every Go game that had ever been played, but instead was trained via self-play RL. But attempts to create general game-playing agents via similar methods haven’t worked out very well, in my understanding. I don’t assume that if we just threw 10x or 100x data at them that this would change...
>The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for.
Yes, but the latter exists and is trained via human reinforcement learning that can’t be translated to self-play. The former doesn’t exist as far as I can tell. I don’t see anyone proposing to improve GPT-4 by turning from HFRL to self-play RL.
Ultimately I think there’s a possibility that the improvements to LLMs from further scaling may not be very large, and instead we’ll need to find some sort of new architecture to create dangerous AGIs.
Gpt-4 did RL feedback that was self evaluation across all the inputs users fed by chatGPT.
Self play would be having it practice leetcode problems with the RL feedback the score.
The software support is there and the RL feedback worked, why do you think it is even evidence to say “obvious thing that works well hasn’t been done yet or maybe it has, openAI won’t say”
There is also a tremendous amount of self play possible now with the new plugin interface.