A valid point. Actually I’m assuming that that is what is necessary to get an LLM to model a more capable General Intelligence (i.e. to train it on a sufficiently wide range of tasks that it can extrapolate out-of-distribution in a great many directions). In general, what people have been finding seems to be that fine-tuning an LLM on dataset much smaller that it pre-training set can bring out latent abilities or behaviors that it already had, or add narrow new capabilities, but making it a whole lot smarter in general requires a dataset comparable in size to the one it was pretrained on. This is a widespread observation, seems very plausible, and would fit with the scaling “laws”, but like much with LLMs it’s not proven fact. Narrow superintelligence is a much easier problem (see, for example, all the things Deepmind is famous for over the last few years: far smarter than GPT-4 across a much narrower range of tasks). Ditto for adding tool use to an LLM, such as the things OpenAI and others have been building recently. So yes, I am assuming that going FOOM requires you to keep increasing and broadening your broad General Intelligence, and that that requires very large token counts. If the first assumption is wrong, and some combination of a finite general intelligent level and continuing to further scale some smallish set of narrow hyperintelligence abilities is sufficient to get you all the way to a singularity, them my argument fails.
Implicit in my assumptions here, and probably worth stating, is that if humanity went Butlerian Jihad and never created AGI, and kept our total population in the less than 10 billion range, then our technological development would eventually slow to a crawl, and perhaps even top out at some “top of the sigmoid curve” capacity level, limited by our IQ. This is of course an untested assumption. I personally find it fairly plausible: we have a lot of narrow subsubspecialities where the total number of experts in the word is O(10), and technological progress is creating more and more of them. But I could be wrong, and if I am, that could affect my argument.
I’m also assuming that, at any finite intelligence level, both the “just run more of them” and “just run them faster” approaches cannot be scaled indefinitely, and hit resource limits for the first one, and speed of light times distance between atoms limits for the second.
Once non-superintelligent AGI can build domain specific narrow superintelligences, it can generate synthetic data that seamlessly integrates their capabilities into general intelligence but doesn’t require general superintelligence to generate (possibly as modalities), circumventing the projections from LLM-only growth. In particular, related to what ChristianKl talks about in the other reply, formal proof seems like an important case of this construction, potentially allowing LLMs to suddenly understand textbooks and papers that their general intelligence wouldn’t be sufficient to figure out, opening the way to build on that understanding, while anchored to capabilities of the narrow formal proof superintelligence (built by humans in this case).
I’m also assuming that, at any finite intelligence level, both the “just run more of them” and “just run them faster” approaches cannot be scaled indefinitely, and hit resource limits for the first one, and speed of light times distance between atoms limits for the second.
The point of “just run them faster” is that this circumvents projections based on any particular AGI architectures, because it allows discovering alternative architectures from distant future within months. At which point it’s no longer “just run them faster”, but something much closer to whatever is possible in principle. And because of the contribution of the “just run them faster” phase this doesn’t take decades or centuries. Singularity-grade change happens from both the “just run them faster” phase and the subsequent phase that exploits its discoveries, both taking very little time on human scale.
In general, what people have been finding seems to be that fine-tuning an LLM on dataset much smaller that it pre-training set can bring out latent abilities or behaviors that it already had, or add narrow new capabilities, but making it a whole lot smarter in general requires a dataset comparable in size to the one it was pretrained on.
Yes, you do need lot of data.
There are a lot of domains where it’s possible to distinguish good answers from bad answers by looking at results.
If you take a lot of mathematical problems, it’s relatively easy to check whether a mathematical proof is correct and hard to write the proof in the first place.
Once you have an AutoGPT-like agent that can do mathematical proofs, you have a lot of room to generate data about mathematical proofs and can optimize for the AutoGPT instance being able to create proofs with less steps of running the LLM.
With the set of prompts that ChatGPT users provided, the agent can also look through the data and find individual problems that have the characteristics that it’s easy to produce problem sets and grade the quality of answers.
A valid point. Actually I’m assuming that that is what is necessary to get an LLM to model a more capable General Intelligence (i.e. to train it on a sufficiently wide range of tasks that it can extrapolate out-of-distribution in a great many directions). In general, what people have been finding seems to be that fine-tuning an LLM on dataset much smaller that it pre-training set can bring out latent abilities or behaviors that it already had, or add narrow new capabilities, but making it a whole lot smarter in general requires a dataset comparable in size to the one it was pretrained on. This is a widespread observation, seems very plausible, and would fit with the scaling “laws”, but like much with LLMs it’s not proven fact. Narrow superintelligence is a much easier problem (see, for example, all the things Deepmind is famous for over the last few years: far smarter than GPT-4 across a much narrower range of tasks). Ditto for adding tool use to an LLM, such as the things OpenAI and others have been building recently. So yes, I am assuming that going FOOM requires you to keep increasing and broadening your broad General Intelligence, and that that requires very large token counts. If the first assumption is wrong, and some combination of a finite general intelligent level and continuing to further scale some smallish set of narrow hyperintelligence abilities is sufficient to get you all the way to a singularity, them my argument fails.
Implicit in my assumptions here, and probably worth stating, is that if humanity went Butlerian Jihad and never created AGI, and kept our total population in the less than 10 billion range, then our technological development would eventually slow to a crawl, and perhaps even top out at some “top of the sigmoid curve” capacity level, limited by our IQ. This is of course an untested assumption. I personally find it fairly plausible: we have a lot of narrow subsubspecialities where the total number of experts in the word is O(10), and technological progress is creating more and more of them. But I could be wrong, and if I am, that could affect my argument.
I’m also assuming that, at any finite intelligence level, both the “just run more of them” and “just run them faster” approaches cannot be scaled indefinitely, and hit resource limits for the first one, and speed of light times distance between atoms limits for the second.
Once non-superintelligent AGI can build domain specific narrow superintelligences, it can generate synthetic data that seamlessly integrates their capabilities into general intelligence but doesn’t require general superintelligence to generate (possibly as modalities), circumventing the projections from LLM-only growth. In particular, related to what ChristianKl talks about in the other reply, formal proof seems like an important case of this construction, potentially allowing LLMs to suddenly understand textbooks and papers that their general intelligence wouldn’t be sufficient to figure out, opening the way to build on that understanding, while anchored to capabilities of the narrow formal proof superintelligence (built by humans in this case).
The point of “just run them faster” is that this circumvents projections based on any particular AGI architectures, because it allows discovering alternative architectures from distant future within months. At which point it’s no longer “just run them faster”, but something much closer to whatever is possible in principle. And because of the contribution of the “just run them faster” phase this doesn’t take decades or centuries. Singularity-grade change happens from both the “just run them faster” phase and the subsequent phase that exploits its discoveries, both taking very little time on human scale.
Yes, you do need lot of data.
There are a lot of domains where it’s possible to distinguish good answers from bad answers by looking at results.
If you take a lot of mathematical problems, it’s relatively easy to check whether a mathematical proof is correct and hard to write the proof in the first place.
Once you have an AutoGPT-like agent that can do mathematical proofs, you have a lot of room to generate data about mathematical proofs and can optimize for the AutoGPT instance being able to create proofs with less steps of running the LLM.
With the set of prompts that ChatGPT users provided, the agent can also look through the data and find individual problems that have the characteristics that it’s easy to produce problem sets and grade the quality of answers.