When reading posts about AI development I get the impression that many people follow a model where the important variables are the data that, out there in the world, the available compute for model training and the available training algorithm.
I think this underrated the importance of synthetic training data generation.
AlphaStar trained entirely on synthetic data to become much better than humans.
There’s an observation that you can’t improve a standard LLM much by retraining it by just feeding it random pieces of it’s own output.
I think there’s a good chance that training on the output on models that can reason like o1 and o3 does allow for improvement.
Just like AlphaStar could make up the necessary training data to become superhuman on its own, it’s possible that this is true for the kind of models like o3 simply by throwing compute at them.
I don’t understand these things much, but it seems to me that we have two different kinds of learning here.
In chess, the chessboard is the territory, and the AI has a direct access to the territory (it can represent the chessboard in its memory). This is what allows the AI to run experiments and learn from them. It can discover new things by trying random changes and observing the outcome.
With LLMs, the territory is “out there”, and the AI only has an access to human maps of the territory. It can surpass humans by synthesizing millions of human maps and connecting the dots, but at the end of the day, they are just maps. Which means two problems:
gaps in the knowledge: possible experiments that no one ever did—and the AI can’t do them either
noise: if some information is wrong, AI can’t check it, other than comparing to other (unreliable) maps
Feeding the AI its own output doesn’t address any of these two problems.
With o1 and o3, we found out that you get better output if you spent more compute on getting the answer. Letting an LLM work 10 minutes via chain of thought on getting an answer gives you better answers than letting it work 2 seconds on getting an answer.
By training on it’s own answers you could get the LLM to give the kind of answers it gives with 10 minutes of chain of thought worth of compute in 2 seconds without chain of thought.
The difference between a person with an IQ of 90 and one with an IQ of 180 is not in gaps of knowledge or having access to information that’s right or wrong but in reasoning ability.
As far as the territory of LLMs being “out there”, that’s true for a subset of the territory. The field of mathematics isn’t “out there”. You don’t need to run any experiments in physical reality to validate claims about math and see whether you get better at creating math proofs.
When it comes to computer programming, some tasks require interaction with physical reality but plenty of tasks are “write a patch that includes a functionality that passes all the unit tests and passes some clean code metric”. You also have “find a possible input for a function that makes it crash” and a few other tasks where you can automatically evaluate the outcome of what the LLM is doing without needing any human input.
The difference between a person with an IQ of 90 and one with an IQ of 180 is not in gaps of knowledge or having access to information that’s right or wrong but in reasoning ability.
There are enormous differences in gaps of knowledge and further, self-selection into niches and lifestyles and skills like looking up information in Google Scholar, between an IQ 90 and a 180 person. Look at vocab norms or simple ordinary trivia questions such as ‘does the earth go around the sun or vice-versa?’. You can’t do much reasoning about what you don’t know about.
(This is one of the biggest reasons that ‘retrieval heavy’ LLMs have underperformed so much. There is no ‘small logical core’ you can easily cheaply learn free of actual real-world knowledge. At best, like the Phi series, you can steal reasoning from a much larger model like GPT-4 which learned it the hard way and has predigested it for you.)
When reading posts about AI development I get the impression that many people follow a model where the important variables are the data that, out there in the world, the available compute for model training and the available training algorithm.
I think this underrated the importance of synthetic training data generation.
AlphaStar trained entirely on synthetic data to become much better than humans.
There’s an observation that you can’t improve a standard LLM much by retraining it by just feeding it random pieces of it’s own output.
I think there’s a good chance that training on the output on models that can reason like o1 and o3 does allow for improvement.
Just like AlphaStar could make up the necessary training data to become superhuman on its own, it’s possible that this is true for the kind of models like o3 simply by throwing compute at them.
I don’t understand these things much, but it seems to me that we have two different kinds of learning here.
In chess, the chessboard is the territory, and the AI has a direct access to the territory (it can represent the chessboard in its memory). This is what allows the AI to run experiments and learn from them. It can discover new things by trying random changes and observing the outcome.
With LLMs, the territory is “out there”, and the AI only has an access to human maps of the territory. It can surpass humans by synthesizing millions of human maps and connecting the dots, but at the end of the day, they are just maps. Which means two problems:
gaps in the knowledge: possible experiments that no one ever did—and the AI can’t do them either
noise: if some information is wrong, AI can’t check it, other than comparing to other (unreliable) maps
Feeding the AI its own output doesn’t address any of these two problems.
With o1 and o3, we found out that you get better output if you spent more compute on getting the answer. Letting an LLM work 10 minutes via chain of thought on getting an answer gives you better answers than letting it work 2 seconds on getting an answer.
By training on it’s own answers you could get the LLM to give the kind of answers it gives with 10 minutes of chain of thought worth of compute in 2 seconds without chain of thought.
The difference between a person with an IQ of 90 and one with an IQ of 180 is not in gaps of knowledge or having access to information that’s right or wrong but in reasoning ability.
As far as the territory of LLMs being “out there”, that’s true for a subset of the territory. The field of mathematics isn’t “out there”. You don’t need to run any experiments in physical reality to validate claims about math and see whether you get better at creating math proofs.
When it comes to computer programming, some tasks require interaction with physical reality but plenty of tasks are “write a patch that includes a functionality that passes all the unit tests and passes some clean code metric”. You also have “find a possible input for a function that makes it crash” and a few other tasks where you can automatically evaluate the outcome of what the LLM is doing without needing any human input.
There are enormous differences in gaps of knowledge and further, self-selection into niches and lifestyles and skills like looking up information in Google Scholar, between an IQ 90 and a 180 person. Look at vocab norms or simple ordinary trivia questions such as ‘does the earth go around the sun or vice-versa?’. You can’t do much reasoning about what you don’t know about.
(This is one of the biggest reasons that ‘retrieval heavy’ LLMs have underperformed so much. There is no ‘small logical core’ you can easily cheaply learn free of actual real-world knowledge. At best, like the Phi series, you can steal reasoning from a much larger model like GPT-4 which learned it the hard way and has predigested it for you.)