First of all, as it turns out in practice, scale was everything. This means that any AI idea you want to name, unless that idea was based on a transformer and worked on by approximately 3 labs, it was never actually attempted.
We can just ignore all the other thousands of AI methods that humans tried because they were not attempted with a relevant level of scale.
Therefore, RSI has never been tried. Second, you can easily design a variation on RSI that works fine with current paradigms.
It’s not precisely RSI but it’s functionally the same thing. Here are the steps:
[1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to ‘play’ the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn’t possible for a model to memorize all permutations)
[2] heuristic weight score on this task intended to measure how “AGI like” a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating “Leonardo da Vinci”, who had exceptional human performance presumably from all this cross domain knowledge.
[3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.
The “design an AGI” subtask can be much simpler than “write all the boilerplate in Python”, but these models will be able to do that if needed.
As tasks scores approach human level across a broad set of tasks, you have an AGI. You would expect it to almost immediately improve to a low superintelligence. As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.
Why aren’t we already doing this if it’s so simple?
Because each AGI candidate training run has to be at least twice as large as llama-65b, so it means 2m+ in training costs per run. And you need to explore the possibility space pretty broadly, so you would figure several thousand runs to really get to a decent AGI design which will not be optimal.
This is one of the reasons foom cannot happen. At least not without a lot more compute than we have now. Each attempt is too expensive.
Can we refine the above algorithm into something more compute efficient? Yes, somewhat (by going to a modular architecture, where each “AGI candidate” is composed of hundreds of smaller networks, and we reuse most of them in between candidates), but it’s going to still require a lot more compute than llama-65b took to train.
So I’m going to strong disagree here.
First of all, as it turns out in practice, scale was everything. This means that any AI idea you want to name, unless that idea was based on a transformer and worked on by approximately 3 labs, it was never actually attempted.
We can just ignore all the other thousands of AI methods that humans tried because they were not attempted with a relevant level of scale.
Therefore, RSI has never been tried.
Second, you can easily design a variation on RSI that works fine with current paradigms.
It’s not precisely RSI but it’s functionally the same thing. Here are the steps:
[1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to ‘play’ the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn’t possible for a model to memorize all permutations)
[2] heuristic weight score on this task intended to measure how “AGI like” a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating “Leonardo da Vinci”, who had exceptional human performance presumably from all this cross domain knowledge.
[3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.
The “design an AGI” subtask can be much simpler than “write all the boilerplate in Python”, but these models will be able to do that if needed.
As tasks scores approach human level across a broad set of tasks, you have an AGI. You would expect it to almost immediately improve to a low superintelligence. As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.
Why aren’t we already doing this if it’s so simple?
Because each AGI candidate training run has to be at least twice as large as llama-65b, so it means 2m+ in training costs per run. And you need to explore the possibility space pretty broadly, so you would figure several thousand runs to really get to a decent AGI design which will not be optimal.
This is one of the reasons foom cannot happen. At least not without a lot more compute than we have now. Each attempt is too expensive.
Can we refine the above algorithm into something more compute efficient? Yes, somewhat (by going to a modular architecture, where each “AGI candidate” is composed of hundreds of smaller networks, and we reuse most of them in between candidates), but it’s going to still require a lot more compute than llama-65b took to train.