I’m asking specifically about the assertion that “RL style self play” could be used to iterate to AGI. I don’t see what sort of game could lead to this outcome. You can’t have this sort of self-play with “solve this math problem” as far as I can tell, and even if you could I don’t see why it would promote AGI as opposed to something that can solve a narrow class of math problems.
Obviously LLMs have amazing generalist capabilities. But as far as I can tell you can’t iterate on the next version of these models by hooking them up to some sort of API that provides useful, immediate feedback… we’re not at the cusp of removing the HF part of the RLHF loop. I think understanding this is key to whether we should expect slow takeoff vs. fast takeoff likelihood.
This will work, the only reason it won’t get used is it is possibly not the computationally cheapest option. (this proposal is incredibly expensive for compute unless we do a lot of reuse of components between iterations).
Whether you consider a machine that has a score heuristic that forces generality by negatively weighting complex specialized architectures and heavily waiting zero shot multimodal/multi-skill tasks, and is able to do hundreds of thousands of tasks an “AGI” is up to your definition.
Since the machine would be self replicating and capable of all industrial, construction, driving, logistics, software writing tasks - all things that conveniently fall into the scope of ‘can be objectively evaluated’ I say it’s an AGI. It’s capable of everything needed to copy itself forever and to self improve, it’s functionally a sentient new civilization. The things you mentioned—like beating GRRM at writing a good story—do not matter.
Sure, this is useful. To your other posts, I don’t think we’re really disagreeing about what AGI is—I think we’d agree that if you took a model with GPT4-like capabilities and hooked it up to a chess API to reinforce it you would end up with a GPT4 model that’s very good at playing chess, not something that has strongly-improved its general underlying world model and thus would also be able to say improve its LSAT score. And this is what I’m imaging most self-play training would accomplish… but I’m open to being wrong. To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
I’m just trying to underscore that there are lots of tasks which we hope that AGIs would be able to accomplish (eg. solving open math problems) but we probably cannot use RL to directly iterate a model to accomplish this task because we can’t define a gradient of reward that would help define the AGI.
To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
No, nobody has been trying anything for decades that matters. As it turns out, the only thing that matters was scale. So there are 3 companies that had enough money for scale, and they are the only efforts that count, and all combined have done a small enough number of full scale experiments you can count them up with 2 hands. @gwern has expressed the opinion that we probably didn’t even need the transformer, other neural networks likely would have worked at these scales.
As for the rest of it, no, we’re saying at massive scales, we abdicate trying to understand AGI architectures—since they are enormously complex and coupled machined—and just iteratively find some that work by trial and error.
“work” includes generality. The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for. (though this can be tough to filter since obviously it’s simply easier to train on all text in existence).
One that has controlled a robot to manipulate fine wire and many object manip tasks, and one that has passed the exams for a course on electronics, and then first try builds a working circuit in a simulated world is what we’re looking for. So more points on that.
That’s the idea. Define what we want the machine to do and what we mean by “generality”, iterate over the search space a very large number of times. In an unbiased way, pick the most distinct n winners and have those winners propose the next round of AGI designs and so on.
And most of the points for the winners are explicitly for the generality behavior we are seeking.
>As it turns out, the only thing that matters was scale.
I mean, in some sense yes. But AlphaGo wasn’t trained by finding a transcript of every Go game that had ever been played, but instead was trained via self-play RL. But attempts to create general game-playing agents via similar methods haven’t worked out very well, in my understanding. I don’t assume that if we just threw 10x or 100x data at them that this would change...
>The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for.
Yes, but the latter exists and is trained via human reinforcement learning that can’t be translated to self-play. The former doesn’t exist as far as I can tell. I don’t see anyone proposing to improve GPT-4 by turning from HFRL to self-play RL.
Ultimately I think there’s a possibility that the improvements to LLMs from further scaling may not be very large, and instead we’ll need to find some sort of new architecture to create dangerous AGIs.
Gpt-4 did RL feedback that was self evaluation across all the inputs users fed by chatGPT.
Self play would be having it practice leetcode problems with the RL feedback the score.
The software support is there and the RL feedback worked, why do you think it is even evidence to say “obvious thing that works well hasn’t been done yet or maybe it has, openAI won’t say”
There is also a tremendous amount of self play possible now with the new plugin interface.
You can connect them to such an API and it’s not hard and we already have the things to make the API and you can start with llms. It’s a fairly simple recursive bench and obvious.
I think you need to define what you think AGI is first.
I think with a reasonable, grounded, and measurable version of AGI it is trivial to do with self play. Please tell me what you think AGI means. I don’t think it matters if there are subjective things the AGI can’t do well.
I’m asking specifically about the assertion that “RL style self play” could be used to iterate to AGI. I don’t see what sort of game could lead to this outcome. You can’t have this sort of self-play with “solve this math problem” as far as I can tell, and even if you could I don’t see why it would promote AGI as opposed to something that can solve a narrow class of math problems.
Obviously LLMs have amazing generalist capabilities. But as far as I can tell you can’t iterate on the next version of these models by hooking them up to some sort of API that provides useful, immediate feedback… we’re not at the cusp of removing the HF part of the RLHF loop. I think understanding this is key to whether we should expect slow takeoff vs. fast takeoff likelihood.
Anyways here’s how to get an AGI this way : https://www.lesswrong.com/posts/Aq82XqYhgqdPdPrBA/full-transcript-eliezer-yudkowsky-on-the-bankless-podcast?commentId=Mvyq996KxiE4LR6ii
This will work, the only reason it won’t get used is it is possibly not the computationally cheapest option. (this proposal is incredibly expensive for compute unless we do a lot of reuse of components between iterations).
Whether you consider a machine that has a score heuristic that forces generality by negatively weighting complex specialized architectures and heavily waiting zero shot multimodal/multi-skill tasks, and is able to do hundreds of thousands of tasks an “AGI” is up to your definition.
Since the machine would be self replicating and capable of all industrial, construction, driving, logistics, software writing tasks - all things that conveniently fall into the scope of ‘can be objectively evaluated’ I say it’s an AGI. It’s capable of everything needed to copy itself forever and to self improve, it’s functionally a sentient new civilization. The things you mentioned—like beating GRRM at writing a good story—do not matter.
Sure, this is useful. To your other posts, I don’t think we’re really disagreeing about what AGI is—I think we’d agree that if you took a model with GPT4-like capabilities and hooked it up to a chess API to reinforce it you would end up with a GPT4 model that’s very good at playing chess, not something that has strongly-improved its general underlying world model and thus would also be able to say improve its LSAT score. And this is what I’m imaging most self-play training would accomplish… but I’m open to being wrong. To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
I’m just trying to underscore that there are lots of tasks which we hope that AGIs would be able to accomplish (eg. solving open math problems) but we probably cannot use RL to directly iterate a model to accomplish this task because we can’t define a gradient of reward that would help define the AGI.
To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
No, nobody has been trying anything for decades that matters. As it turns out, the only thing that matters was scale. So there are 3 companies that had enough money for scale, and they are the only efforts that count, and all combined have done a small enough number of full scale experiments you can count them up with 2 hands. @gwern has expressed the opinion that we probably didn’t even need the transformer, other neural networks likely would have worked at these scales.
As for the rest of it, no, we’re saying at massive scales, we abdicate trying to understand AGI architectures—since they are enormously complex and coupled machined—and just iteratively find some that work by trial and error.
“work” includes generality. The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for. (though this can be tough to filter since obviously it’s simply easier to train on all text in existence).
One that has controlled a robot to manipulate fine wire and many object manip tasks, and one that has passed the exams for a course on electronics, and then first try builds a working circuit in a simulated world is what we’re looking for. So more points on that.
That’s the idea. Define what we want the machine to do and what we mean by “generality”, iterate over the search space a very large number of times. In an unbiased way, pick the most distinct n winners and have those winners propose the next round of AGI designs and so on.
And most of the points for the winners are explicitly for the generality behavior we are seeking.
>As it turns out, the only thing that matters was scale.
I mean, in some sense yes. But AlphaGo wasn’t trained by finding a transcript of every Go game that had ever been played, but instead was trained via self-play RL. But attempts to create general game-playing agents via similar methods haven’t worked out very well, in my understanding. I don’t assume that if we just threw 10x or 100x data at them that this would change...
>The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for.
Yes, but the latter exists and is trained via human reinforcement learning that can’t be translated to self-play. The former doesn’t exist as far as I can tell. I don’t see anyone proposing to improve GPT-4 by turning from HFRL to self-play RL.
Ultimately I think there’s a possibility that the improvements to LLMs from further scaling may not be very large, and instead we’ll need to find some sort of new architecture to create dangerous AGIs.
Gpt-4 did RL feedback that was self evaluation across all the inputs users fed by chatGPT.
Self play would be having it practice leetcode problems with the RL feedback the score.
The software support is there and the RL feedback worked, why do you think it is even evidence to say “obvious thing that works well hasn’t been done yet or maybe it has, openAI won’t say”
There is also a tremendous amount of self play possible now with the new plugin interface.
You can connect them to such an API and it’s not hard and we already have the things to make the API and you can start with llms. It’s a fairly simple recursive bench and obvious.
Main limit is just money.
I think you need to define what you think AGI is first.
I think with a reasonable, grounded, and measurable version of AGI it is trivial to do with self play. Please tell me what you think AGI means. I don’t think it matters if there are subjective things the AGI can’t do well.