Here are two arguments for low-hanging algorithmic improvements.
First, in the past few years I have read many papers containing low-hanging algorithmic improvements. Most such improvements are a few percent or tens of percent. The largest such improvements are things like transformers or mixture of experts, which are substantial steps forward. Such a trend is not guaranteed to persist, but that’s the way to bet.
Second, existing models are far less sample-efficient than humans. We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better. Of course, there’s no guarantee that such an improvement is “low hanging”.
We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better.
Capturing this would probably be a big deal, but a counterpoint is that compute necessary to achieve an autonomous researcher using such sample efficient method might still be very large. Possibly so large that training an LLM with the same compute and current sample-inefficient methods is already sufficient to get a similarly effective autonomous researcher chatbot. In which case there is no effect on timelines. And given that the amount of data is not an imminent constraint on scaling, the possibility of this sample efficiency improvement being useless for the human-led stage of AI development won’t be ruled out for some time yet.
The best method of improving sample efficiency might be more like AlphaZero. The simplest method that’s more likely to be discovered might be more like training on the same data over and over with diminishing returns. Since we are talking low-hanging fruit, I think it’s reasonable that first forays into significantly improved sample efficiency with respect to real data are not yet much better than simply using more unique real data.
I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity.
I would expect it to do what it is trained to: emulate / predict as best as possible the distribution of human play.
To some degree I would anticipate the transformer might develop some emergent ability that might make it slightly better than Go-Magnus—as we’ve seen in other cases—but I’d be surprised if this would be unbounded. This is simply not what the training signal is.
We start with an LLM trained on 50T tokens of real data, however capable it ends up being, and ask how to reach the same level of capability with synthetic data. If it takes more than 50T tokens of synthetic data, then it was less valuable per token than real data.
But at the same time, 500T tokens of synthetic data might train an LLM more capable than if trained on the 50T tokens of real data for 10 epochs. In that case, synthetic data helps with scaling capabilities beyond what real data enables, even though it’s still less valuable per token.
With Go, we might just be running into the contingent fact of there not being enough real data to be worth talking about, compared with LLM data for general intelligence. If we run out of real data before some threshold of usefulness, synthetic data becomes crucial (which is the case with Go). It’s unclear if this is the case for general intelligence with LLMs, but if it is, then there won’t be enough compute to improve the situation unless synthetic data also becomes better per token, and not merely mitigates the data bottleneck and enables further improvement given unbounded compute.
I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity.
I expect that if we could magically sample much more pre-2014 unique human Go data than was actually generated by actual humans (rather than repeating the limited data we have), from the same platonic source and without changing the level of play, then it would be possible to cheaply tune an LLM trained on it to play superhuman Go.
I don’t know what you mean by ‘general intelligence’ exactly but I suspect you mean something like human+ capability in a broad range of domains.
I agree LLMs will become generally intelligent in this sense when scaled, arguably even are, for domains with sufficient data.
But that’s kind of the sticker right? Cave men didn’t have the whole internet to learn from yet somehow did something that not even you seem to claim LLMs will be able to do: create the (date of the) Internet.
(Your last claim seems surprising. Pre-2014 games don’t have close to the ELO of alphaZero. So a next-token would be trained to simulate a human player up tot 2800, not 3200+. )
Pre-2014 games don’t have close to the ELO of alphaZero. So a next-token would be trained to simulate a human player up to 2800, not 3200+.
Models can be thought of as repositories of features rather than token predictors. A single human player knows some things, but a sufficiently trained model knows all the things that any of the players know. Appropriately tuned, a model might be able to tap into this collective knowledge to a greater degree than any single human player. Once the features are known, tuning and in-context learning that elicit their use are very sample efficient.
This framing seems crucial for expecting LLMs to reach researcher level of capability given a realistic amount of data, since most humans are not researchers, and don’t all specialize in the same problem. The things researcher LLMs would need to succeed in learning are cognitive skills, so that in-context performance gets very good at responding to novel engineering and research agendas only seen in-context (or a certain easier feat that I won’t explicitly elaborate on).
Cave men didn’t have the whole internet to learn from yet somehow did something that not even you seem to claim LLMs will be able to do: create the (date of the) Internet.
Possibly the explanation for the Sapient Paradox, that prehistoric humans managed to spend on the order of 100,000 years without developing civilization, is that they lacked cultural knowledge of crucial general cognitive skills. Sample efficiency of the brain enabled their fixation in language across cultures and generations, once they were eventually distilled, but it took quite a lot of time.
Modern humans and LLMs start with all these skills already available in the data, though humans can more easily learn them. LLMs tuned to tap into all of these skills at the same time might be able to go a long way without an urgent need to distill new ones, merely iterating on novel engineering and scientific challenges, applying the same general cognitive skills over and over.
When I brought up sample inefficiency, I was supporting Mr. Helm-Burger‘s statement that “there’s huge algorithmic gains in …training efficiency (less data, less compute) … waiting to be discovered”. You’re right of course that a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet.
a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet
I’m ambivalent on this. If the analogy between improvement of sample efficiency and generation of synthetic data holds, synthetic data seems reasonably likely to be less valuable than real data (per token). In that case we’d be using all the real data we have anyway, which with repetition is sufficient for up to about $100 billion training runs (we are at $100 million right now). Without autonomous agency (not necessarily at researcher level) before that point, there won’t be investment to go over that scale until much later, when hardware improves and the cost goes down.
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
Why do you think there are these low-hanging algorithmic improvements?
Here are two arguments for low-hanging algorithmic improvements.
First, in the past few years I have read many papers containing low-hanging algorithmic improvements. Most such improvements are a few percent or tens of percent. The largest such improvements are things like transformers or mixture of experts, which are substantial steps forward. Such a trend is not guaranteed to persist, but that’s the way to bet.
Second, existing models are far less sample-efficient than humans. We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better. Of course, there’s no guarantee that such an improvement is “low hanging”.
Capturing this would probably be a big deal, but a counterpoint is that compute necessary to achieve an autonomous researcher using such sample efficient method might still be very large. Possibly so large that training an LLM with the same compute and current sample-inefficient methods is already sufficient to get a similarly effective autonomous researcher chatbot. In which case there is no effect on timelines. And given that the amount of data is not an imminent constraint on scaling, the possibility of this sample efficiency improvement being useless for the human-led stage of AI development won’t be ruled out for some time yet.
Could you train an LLM on pre 2014 Go games that could beat AlphaZero?
I rest my case.
The best method of improving sample efficiency might be more like AlphaZero. The simplest method that’s more likely to be discovered might be more like training on the same data over and over with diminishing returns. Since we are talking low-hanging fruit, I think it’s reasonable that first forays into significantly improved sample efficiency with respect to real data are not yet much better than simply using more unique real data.
I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity. I would expect it to do what it is trained to: emulate / predict as best as possible the distribution of human play. To some degree I would anticipate the transformer might develop some emergent ability that might make it slightly better than Go-Magnus—as we’ve seen in other cases—but I’d be surprised if this would be unbounded. This is simply not what the training signal is.
We start with an LLM trained on 50T tokens of real data, however capable it ends up being, and ask how to reach the same level of capability with synthetic data. If it takes more than 50T tokens of synthetic data, then it was less valuable per token than real data.
But at the same time, 500T tokens of synthetic data might train an LLM more capable than if trained on the 50T tokens of real data for 10 epochs. In that case, synthetic data helps with scaling capabilities beyond what real data enables, even though it’s still less valuable per token.
With Go, we might just be running into the contingent fact of there not being enough real data to be worth talking about, compared with LLM data for general intelligence. If we run out of real data before some threshold of usefulness, synthetic data becomes crucial (which is the case with Go). It’s unclear if this is the case for general intelligence with LLMs, but if it is, then there won’t be enough compute to improve the situation unless synthetic data also becomes better per token, and not merely mitigates the data bottleneck and enables further improvement given unbounded compute.
I expect that if we could magically sample much more pre-2014 unique human Go data than was actually generated by actual humans (rather than repeating the limited data we have), from the same platonic source and without changing the level of play, then it would be possible to cheaply tune an LLM trained on it to play superhuman Go.
I don’t know what you mean by ‘general intelligence’ exactly but I suspect you mean something like human+ capability in a broad range of domains. I agree LLMs will become generally intelligent in this sense when scaled, arguably even are, for domains with sufficient data. But that’s kind of the sticker right? Cave men didn’t have the whole internet to learn from yet somehow did something that not even you seem to claim LLMs will be able to do: create the (date of the) Internet.
(Your last claim seems surprising. Pre-2014 games don’t have close to the ELO of alphaZero. So a next-token would be trained to simulate a human player up tot 2800, not 3200+. )
Models can be thought of as repositories of features rather than token predictors. A single human player knows some things, but a sufficiently trained model knows all the things that any of the players know. Appropriately tuned, a model might be able to tap into this collective knowledge to a greater degree than any single human player. Once the features are known, tuning and in-context learning that elicit their use are very sample efficient.
This framing seems crucial for expecting LLMs to reach researcher level of capability given a realistic amount of data, since most humans are not researchers, and don’t all specialize in the same problem. The things researcher LLMs would need to succeed in learning are cognitive skills, so that in-context performance gets very good at responding to novel engineering and research agendas only seen in-context (or a certain easier feat that I won’t explicitly elaborate on).
Possibly the explanation for the Sapient Paradox, that prehistoric humans managed to spend on the order of 100,000 years without developing civilization, is that they lacked cultural knowledge of crucial general cognitive skills. Sample efficiency of the brain enabled their fixation in language across cultures and generations, once they were eventually distilled, but it took quite a lot of time.
Modern humans and LLMs start with all these skills already available in the data, though humans can more easily learn them. LLMs tuned to tap into all of these skills at the same time might be able to go a long way without an urgent need to distill new ones, merely iterating on novel engineering and scientific challenges, applying the same general cognitive skills over and over.
When I brought up sample inefficiency, I was supporting Mr. Helm-Burger‘s statement that “there’s huge algorithmic gains in …training efficiency (less data, less compute) … waiting to be discovered”. You’re right of course that a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet.
I’m ambivalent on this. If the analogy between improvement of sample efficiency and generation of synthetic data holds, synthetic data seems reasonably likely to be less valuable than real data (per token). In that case we’d be using all the real data we have anyway, which with repetition is sufficient for up to about $100 billion training runs (we are at $100 million right now). Without autonomous agency (not necessarily at researcher level) before that point, there won’t be investment to go over that scale until much later, when hardware improves and the cost goes down.
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.