I’ve long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don’t understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.
I won’t get into this debate here but I do want to note that my timelines have lengthened, primarily because some of the never-clearly-stated but heavily implied AI developments by proponents of very short timelines have not materialized. To be clear, it has only been a year since gpt-4 is released, and gpt-5 is around the corner, so perhaps my hope is premature. Still my timelines are lengthening.
A year ago, when gpt-3 came out progress was blindingly fast. Part of short timelines came from a sense of ‘if we got surprised so hard by gpt2-3, we are completely uncalibrated, who knows what comes next’.
People seemed surprised by gpt-4 in a way that seemed uncalibrated to me. gpt-4 performance was basically in line with what one would expect if the scaling laws continued to hold. At the time it was already clear that the only really important driver was compute data and that we would run out of both shortly after gpt-4. Scaling proponents suggested this was only the beginning, that there was a whole host of innovation that would be coming. Whispers of mesa-optimizers and simulators.
One year in: Chain-of-thought doesn’t actually improve things that much. External memory and super context lengths ditto. A whole list of proposed architectures seem to serve solely as a paper mill. Every month there is new hype about the latest LLM or image model. Yet they never deviate from expectations based on simple extrapolation of the scaling laws. There is only one thing that really seems to matter and that is compute and data. We have about 3 more OOMs of compute to go. Data may be milked another OOM.
A big question will be whether gpt-5 will suddenly make agentGPT work ( and to what degree). It would seem that gpt-4 is in many ways far more capable than (most or all) humans yet agentGPT is curiously bad.
All-in-all AI progress** is developing according to the naive extrapolations of Scaling Laws but nothing beyond that. The breathless twitter hype about new models is still there but it seems to be believed more at a simulacra level higher than I can parse.
Does this mean we’ll hit an AI winter? No. In my model there may be only one remaining roadblock to ASI (and I suspect I know what it is). That innovation could come at anytime. I don’t know how hard it is, but I suspect it is not too hard.
* the term AGI seems to denote vastly different things to different people in a way I find deeply confusing. I notice that the thing that I thought everybody meant by AGI is now being called ASI. So when I write AGI, feel free to substitute ASI.
** or better, AI congress
addendum: since I’ve been quoted in dmurfet’s AXRP interview as believing that there are certain kinds of reasoning that cannot be represented by transformers/LLMs I want to be clear that this is not really an accurate portrayal of my beliefs. e.g. I don’t think transformers don’t truly understand, are just a stochastic parrot, or in other ways can’t engage in the abstract reasoning that humans do. I think this is clearly false, as seen by interacting with any frontier model.
With scale, there is visible improvement in difficulty of novel-to-chatbot ideas/details that is possible to explain in-context, things like issues with the code it’s writing. If a chatbot is below some threshold of situational awareness of a task, no scaffolding can keep it on track, but for a better chatbot trivial scaffolding might suffice. Many people can’t google for a solution to a technical issue, the difference between them and those who can is often subtle.
So modest amount of scaling alone seems plausibly sufficient for making chatbots that can do whole jobs almost autonomously. If this works, 1-2 OOMs more of scaling becomes both economically feasible and more likely to be worthwhile. LLMs think much faster, so they only need to be barely smart enough to help with clearing those remaining roadblocks.
At this moment in time, it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
That’s what I’m also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.
What I don’t get about this position:
If it was indeed just scaling—what’s AI research for ? There is nothing to discover, just scale more compute. Sure you can maybe improve the speed of deploying compute a little but at the core of it it seems like a story that’s in conflict with itself?
My view is that there’s huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it’s not that scaling leads to ASI directly, it’s:
scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
Training these newly discovered architectures at large scales leads to much more competent automated researchers.
This process repeats quickly over a few months or years.
This process results in AGI.
AGI, if instructed (or allowed, if it’s agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.
Note that this process isn’t inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren’t disempowered until near the end of this.
Here are two arguments for low-hanging algorithmic improvements.
First, in the past few years I have read many papers containing low-hanging algorithmic improvements. Most such improvements are a few percent or tens of percent. The largest such improvements are things like transformers or mixture of experts, which are substantial steps forward. Such a trend is not guaranteed to persist, but that’s the way to bet.
Second, existing models are far less sample-efficient than humans. We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better. Of course, there’s no guarantee that such an improvement is “low hanging”.
We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better.
Capturing this would probably be a big deal, but a counterpoint is that compute necessary to achieve an autonomous researcher using such sample efficient method might still be very large. Possibly so large that training an LLM with the same compute and current sample-inefficient methods is already sufficient to get a similarly effective autonomous researcher chatbot. In which case there is no effect on timelines. And given that the amount of data is not an imminent constraint on scaling, the possibility of this sample efficiency improvement being useless for the human-led stage of AI development won’t be ruled out for some time yet.
The best method of improving sample efficiency might be more like AlphaZero. The simplest method that’s more likely to be discovered might be more like training on the same data over and over with diminishing returns. Since we are talking low-hanging fruit, I think it’s reasonable that first forays into significantly improved sample efficiency with respect to real data are not yet much better than simply using more unique real data.
I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity.
I would expect it to do what it is trained to: emulate / predict as best as possible the distribution of human play.
To some degree I would anticipate the transformer might develop some emergent ability that might make it slightly better than Go-Magnus—as we’ve seen in other cases—but I’d be surprised if this would be unbounded. This is simply not what the training signal is.
We start with an LLM trained on 50T tokens of real data, however capable it ends up being, and ask how to reach the same level of capability with synthetic data. If it takes more than 50T tokens of synthetic data, then it was less valuable per token than real data.
But at the same time, 500T tokens of synthetic data might train an LLM more capable than if trained on the 50T tokens of real data for 10 epochs. In that case, synthetic data helps with scaling capabilities beyond what real data enables, even though it’s still less valuable per token.
With Go, we might just be running into the contingent fact of there not being enough real data to be worth talking about, compared with LLM data for general intelligence. If we run out of real data before some threshold of usefulness, synthetic data becomes crucial (which is the case with Go). It’s unclear if this is the case for general intelligence with LLMs, but if it is, then there won’t be enough compute to improve the situation unless synthetic data also becomes better per token, and not merely mitigates the data bottleneck and enables further improvement given unbounded compute.
I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity.
I expect that if we could magically sample much more pre-2014 unique human Go data than was actually generated by actual humans (rather than repeating the limited data we have), from the same platonic source and without changing the level of play, then it would be possible to cheaply tune an LLM trained on it to play superhuman Go.
I don’t know what you mean by ‘general intelligence’ exactly but I suspect you mean something like human+ capability in a broad range of domains.
I agree LLMs will become generally intelligent in this sense when scaled, arguably even are, for domains with sufficient data.
But that’s kind of the sticker right? Cave men didn’t have the whole internet to learn from yet somehow did something that not even you seem to claim LLMs will be able to do: create the (date of the) Internet.
(Your last claim seems surprising. Pre-2014 games don’t have close to the ELO of alphaZero. So a next-token would be trained to simulate a human player up tot 2800, not 3200+. )
Pre-2014 games don’t have close to the ELO of alphaZero. So a next-token would be trained to simulate a human player up to 2800, not 3200+.
Models can be thought of as repositories of features rather than token predictors. A single human player knows some things, but a sufficiently trained model knows all the things that any of the players know. Appropriately tuned, a model might be able to tap into this collective knowledge to a greater degree than any single human player. Once the features are known, tuning and in-context learning that elicit their use are very sample efficient.
This framing seems crucial for expecting LLMs to reach researcher level of capability given a realistic amount of data, since most humans are not researchers, and don’t all specialize in the same problem. The things researcher LLMs would need to succeed in learning are cognitive skills, so that in-context performance gets very good at responding to novel engineering and research agendas only seen in-context (or a certain easier feat that I won’t explicitly elaborate on).
Cave men didn’t have the whole internet to learn from yet somehow did something that not even you seem to claim LLMs will be able to do: create the (date of the) Internet.
Possibly the explanation for the Sapient Paradox, that prehistoric humans managed to spend on the order of 100,000 years without developing civilization, is that they lacked cultural knowledge of crucial general cognitive skills. Sample efficiency of the brain enabled their fixation in language across cultures and generations, once they were eventually distilled, but it took quite a lot of time.
Modern humans and LLMs start with all these skills already available in the data, though humans can more easily learn them. LLMs tuned to tap into all of these skills at the same time might be able to go a long way without an urgent need to distill new ones, merely iterating on novel engineering and scientific challenges, applying the same general cognitive skills over and over.
When I brought up sample inefficiency, I was supporting Mr. Helm-Burger‘s statement that “there’s huge algorithmic gains in …training efficiency (less data, less compute) … waiting to be discovered”. You’re right of course that a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet.
a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet
I’m ambivalent on this. If the analogy between improvement of sample efficiency and generation of synthetic data holds, synthetic data seems reasonably likely to be less valuable than real data (per token). In that case we’d be using all the real data we have anyway, which with repetition is sufficient for up to about $100 billion training runs (we are at $100 million right now). Without autonomous agency (not necessarily at researcher level) before that point, there won’t be investment to go over that scale until much later, when hardware improves and the cost goes down.
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
The story involves phase changes. Just scaling is what’s likely to be available to human developers in the short term (a few years), it’s not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.
It’s not a little speed, it’s a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.
In my mainline model there are only a few innovations needed, perhaps only a single big one to product an AGI which just like the Turing Machine sits at the top of the Chomsky Hierarchy will be basically the optimal architecture given resource constraints. There are probably some minor improvements todo with bridging the gap between theoretically optimal architecture and the actual architecture, or parts of the algorithm that can be indefinitely improved but with diminishing returns (these probably exist due to Levin and possibly.matrix.multiplication is one of these). On the whole I expect AI research to be very chunky.
Indeed, we’ve seen that there was really just one big idea to all current AI progress: scaling, specifically scaling GPUs on maximally large undifferentiated datasets. There were some minor technical innovations needed to pull this off but on the whole that was the clinger.
Of course, I don’t know. Nobody knows. But I find this the most plausible guess based on what we know about intelligence, learning, theoretical computer science and science in general.
(Re: Difficult to Parse react on the other comment
I was confused about relevance of your comment above on chunky innovations, and it seems to be making some point (for which what it actually says is an argument), but I can’t figure out what it is. One clue was that it seems like you might be talking about innovations needed for superintelligence, while I was previously talking about possible absence of need for further innovations to reach autonomous researcher chatbots, an easier target. So I replied with formulating this distinction and some thoughts on the impact and conditions for reaching innovations of both kinds. Possibly the relevance of this was confusing in turn.)
There are two kinds of relevant hypothetical innovations: those that enable chatbot-led autonomous research, and those that enable superintelligence. It’s plausible that there is no need for (more of) the former, so that mere scaling through human efforts will lead to such chatbots in a few years regardless. (I think it’s essentially inevitable that there is currently enough compute that with appropriate innovations we can get such autonomous human-scale-genius chatbots, but it’s unclear if these innovations are necessary or easy to discover.) If autonomous chatbots are still anything like current LLMs, they are very fast compared to humans, so they quickly discover remaining major innovations of both kinds.
In principle, even if innovations that enable superintelligence (at scale feasible with human efforts in a few years) don’t exist at all, extremely fast autonomous research and engineering still lead to superintelligence, because they greatly accelerate scaling. Physical infrastructure might start scaling really fast using pathways like macroscopic biotech even if drexlerian nanotech is too hard without superintelligence or impossible in principle. Drosophila biomass doubles every 2 days, small things can assemble into large things.
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.
To some degree yes, they were not guaranteed to hold. But by that point they held for over 10 OOMs iirc and there was no known reason they couldn’t continue.
This might be the particular twitter bubble I was in but people definitely predicted capabilities beyond simple extrapolation of scaling laws.
When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech
Can you expand on what you mean by “create nanotech?” If improvements to our current photolithography techniques count, I would not be surprised if (scaffolded) LLMs could be useful for that. Likewise for getting bacteria to express polypeptide catalysts for useful reactions, and even maybe figure out how to chain several novel catalysts together to produce something useful (again, referring to scaffolded LLMs with access to tools).
If you mean that LLMs won’t be able to bootstrap from our current “nanotech only exists in biological systems and chip fabs” world to Drexler-style nanofactories, I agree with that, but I expect things will get crazy enough that I can’t predict them long before nanofactories are a thing (if they ever are).
or even obsolete their job
Likewise, I don’t think LLMs can immediately obsolete all of the parts of my job. But they sure do make parts of my job a lot easier. If you have 100 workers that each spend 90% of their time on one specific task, and you automate that task, that’s approximately as useful as fully automating the jobs of 90 workers. “Human-equivalent” is one of those really leaky abstractions—I would be pretty surprised if the world had any significant resemblance to the world of today by the time robotic systems approached the dexterity and sensitivity of human hands for all of the tasks we use our hands for, whereas for the task of “lift heavy stuff” or “go really fast” machines left us in the dust long ago.
Iterative improvements on the timescale we’re likely to see are still likely to be pretty crazy by historical standards. But yeah, if your timelines were “end of the world by 2026” I can see why they’d be lengthening now.
My timelines were not 2026. In fact, I made bets against doomers 2-3 years ago, one will resolve by next year.
I agree iterative improvements are significant. This falls under “naive extrapolation of scaling laws”.
By nanotech I mean something akin to drexlerian nanotech or something similarly transformative in the vicinity. I think it is plausible that a true ASI will be able to make rapid progress (perhaps on the order of a few years or a decade) on nanotech.
I suspect that people that don’t take this as a serious possibility haven’t really thought through what AGI/ASI means + what the limits and drivers of science and tech really are; I suspect they are simply falling prey to status-quo bias.
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
State-of-the-art models such as Gemini aren’t LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
Chain-of-thought prompting makes models much more capable. In the original paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.
I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.
I just asked GPT-4 a GSM8K problem and I agree with your point. I think what’s happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it’s no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to “respond with just a single number” to eliminate the chain-of-thought reasoning it’s problem-solving ability is much worse.
My timelines are lengthening.
I’ve long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don’t understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.
I won’t get into this debate here but I do want to note that my timelines have lengthened, primarily because some of the never-clearly-stated but heavily implied AI developments by proponents of very short timelines have not materialized. To be clear, it has only been a year since gpt-4 is released, and gpt-5 is around the corner, so perhaps my hope is premature. Still my timelines are lengthening.
A year ago, when gpt-3 came out progress was blindingly fast. Part of short timelines came from a sense of ‘if we got surprised so hard by gpt2-3, we are completely uncalibrated, who knows what comes next’.
People seemed surprised by gpt-4 in a way that seemed uncalibrated to me. gpt-4 performance was basically in line with what one would expect if the scaling laws continued to hold. At the time it was already clear that the only really important driver was compute data and that we would run out of both shortly after gpt-4. Scaling proponents suggested this was only the beginning, that there was a whole host of innovation that would be coming. Whispers of mesa-optimizers and simulators.
One year in: Chain-of-thought doesn’t actually improve things that much. External memory and super context lengths ditto. A whole list of proposed architectures seem to serve solely as a paper mill. Every month there is new hype about the latest LLM or image model. Yet they never deviate from expectations based on simple extrapolation of the scaling laws. There is only one thing that really seems to matter and that is compute and data. We have about 3 more OOMs of compute to go. Data may be milked another OOM.
A big question will be whether gpt-5 will suddenly make agentGPT work ( and to what degree). It would seem that gpt-4 is in many ways far more capable than (most or all) humans yet agentGPT is curiously bad.
All-in-all AI progress** is developing according to the naive extrapolations of Scaling Laws but nothing beyond that. The breathless twitter hype about new models is still there but it seems to be believed more at a simulacra level higher than I can parse.
Does this mean we’ll hit an AI winter? No. In my model there may be only one remaining roadblock to ASI (and I suspect I know what it is). That innovation could come at anytime. I don’t know how hard it is, but I suspect it is not too hard.
* the term AGI seems to denote vastly different things to different people in a way I find deeply confusing. I notice that the thing that I thought everybody meant by AGI is now being called ASI. So when I write AGI, feel free to substitute ASI.
** or better, AI congress
addendum: since I’ve been quoted in dmurfet’s AXRP interview as believing that there are certain kinds of reasoning that cannot be represented by transformers/LLMs I want to be clear that this is not really an accurate portrayal of my beliefs. e.g. I don’t think transformers don’t truly understand, are just a stochastic parrot, or in other ways can’t engage in the abstract reasoning that humans do. I think this is clearly false, as seen by interacting with any frontier model.
With scale, there is visible improvement in difficulty of novel-to-chatbot ideas/details that is possible to explain in-context, things like issues with the code it’s writing. If a chatbot is below some threshold of situational awareness of a task, no scaffolding can keep it on track, but for a better chatbot trivial scaffolding might suffice. Many people can’t google for a solution to a technical issue, the difference between them and those who can is often subtle.
So modest amount of scaling alone seems plausibly sufficient for making chatbots that can do whole jobs almost autonomously. If this works, 1-2 OOMs more of scaling becomes both economically feasible and more likely to be worthwhile. LLMs think much faster, so they only need to be barely smart enough to help with clearing those remaining roadblocks.
You may be right. I don’t know of course.
At this moment in time, it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
That’s what I’m also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.
Yes agreed.
What I don’t get about this position: If it was indeed just scaling—what’s AI research for ? There is nothing to discover, just scale more compute. Sure you can maybe improve the speed of deploying compute a little but at the core of it it seems like a story that’s in conflict with itself?
My view is that there’s huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it’s not that scaling leads to ASI directly, it’s:
scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
Training these newly discovered architectures at large scales leads to much more competent automated researchers.
This process repeats quickly over a few months or years.
This process results in AGI.
AGI, if instructed (or allowed, if it’s agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.
Note that this process isn’t inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren’t disempowered until near the end of this.
Why do you think there are these low-hanging algorithmic improvements?
Here are two arguments for low-hanging algorithmic improvements.
First, in the past few years I have read many papers containing low-hanging algorithmic improvements. Most such improvements are a few percent or tens of percent. The largest such improvements are things like transformers or mixture of experts, which are substantial steps forward. Such a trend is not guaranteed to persist, but that’s the way to bet.
Second, existing models are far less sample-efficient than humans. We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better. Of course, there’s no guarantee that such an improvement is “low hanging”.
Capturing this would probably be a big deal, but a counterpoint is that compute necessary to achieve an autonomous researcher using such sample efficient method might still be very large. Possibly so large that training an LLM with the same compute and current sample-inefficient methods is already sufficient to get a similarly effective autonomous researcher chatbot. In which case there is no effect on timelines. And given that the amount of data is not an imminent constraint on scaling, the possibility of this sample efficiency improvement being useless for the human-led stage of AI development won’t be ruled out for some time yet.
Could you train an LLM on pre 2014 Go games that could beat AlphaZero?
I rest my case.
The best method of improving sample efficiency might be more like AlphaZero. The simplest method that’s more likely to be discovered might be more like training on the same data over and over with diminishing returns. Since we are talking low-hanging fruit, I think it’s reasonable that first forays into significantly improved sample efficiency with respect to real data are not yet much better than simply using more unique real data.
I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity. I would expect it to do what it is trained to: emulate / predict as best as possible the distribution of human play. To some degree I would anticipate the transformer might develop some emergent ability that might make it slightly better than Go-Magnus—as we’ve seen in other cases—but I’d be surprised if this would be unbounded. This is simply not what the training signal is.
We start with an LLM trained on 50T tokens of real data, however capable it ends up being, and ask how to reach the same level of capability with synthetic data. If it takes more than 50T tokens of synthetic data, then it was less valuable per token than real data.
But at the same time, 500T tokens of synthetic data might train an LLM more capable than if trained on the 50T tokens of real data for 10 epochs. In that case, synthetic data helps with scaling capabilities beyond what real data enables, even though it’s still less valuable per token.
With Go, we might just be running into the contingent fact of there not being enough real data to be worth talking about, compared with LLM data for general intelligence. If we run out of real data before some threshold of usefulness, synthetic data becomes crucial (which is the case with Go). It’s unclear if this is the case for general intelligence with LLMs, but if it is, then there won’t be enough compute to improve the situation unless synthetic data also becomes better per token, and not merely mitigates the data bottleneck and enables further improvement given unbounded compute.
I expect that if we could magically sample much more pre-2014 unique human Go data than was actually generated by actual humans (rather than repeating the limited data we have), from the same platonic source and without changing the level of play, then it would be possible to cheaply tune an LLM trained on it to play superhuman Go.
I don’t know what you mean by ‘general intelligence’ exactly but I suspect you mean something like human+ capability in a broad range of domains. I agree LLMs will become generally intelligent in this sense when scaled, arguably even are, for domains with sufficient data. But that’s kind of the sticker right? Cave men didn’t have the whole internet to learn from yet somehow did something that not even you seem to claim LLMs will be able to do: create the (date of the) Internet.
(Your last claim seems surprising. Pre-2014 games don’t have close to the ELO of alphaZero. So a next-token would be trained to simulate a human player up tot 2800, not 3200+. )
Models can be thought of as repositories of features rather than token predictors. A single human player knows some things, but a sufficiently trained model knows all the things that any of the players know. Appropriately tuned, a model might be able to tap into this collective knowledge to a greater degree than any single human player. Once the features are known, tuning and in-context learning that elicit their use are very sample efficient.
This framing seems crucial for expecting LLMs to reach researcher level of capability given a realistic amount of data, since most humans are not researchers, and don’t all specialize in the same problem. The things researcher LLMs would need to succeed in learning are cognitive skills, so that in-context performance gets very good at responding to novel engineering and research agendas only seen in-context (or a certain easier feat that I won’t explicitly elaborate on).
Possibly the explanation for the Sapient Paradox, that prehistoric humans managed to spend on the order of 100,000 years without developing civilization, is that they lacked cultural knowledge of crucial general cognitive skills. Sample efficiency of the brain enabled their fixation in language across cultures and generations, once they were eventually distilled, but it took quite a lot of time.
Modern humans and LLMs start with all these skills already available in the data, though humans can more easily learn them. LLMs tuned to tap into all of these skills at the same time might be able to go a long way without an urgent need to distill new ones, merely iterating on novel engineering and scientific challenges, applying the same general cognitive skills over and over.
When I brought up sample inefficiency, I was supporting Mr. Helm-Burger‘s statement that “there’s huge algorithmic gains in …training efficiency (less data, less compute) … waiting to be discovered”. You’re right of course that a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet.
I’m ambivalent on this. If the analogy between improvement of sample efficiency and generation of synthetic data holds, synthetic data seems reasonably likely to be less valuable than real data (per token). In that case we’d be using all the real data we have anyway, which with repetition is sufficient for up to about $100 billion training runs (we are at $100 million right now). Without autonomous agency (not necessarily at researcher level) before that point, there won’t be investment to go over that scale until much later, when hardware improves and the cost goes down.
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
The story involves phase changes. Just scaling is what’s likely to be available to human developers in the short term (a few years), it’s not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.
It’s not a little speed, it’s a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.
In my mainline model there are only a few innovations needed, perhaps only a single big one to product an AGI which just like the Turing Machine sits at the top of the Chomsky Hierarchy will be basically the optimal architecture given resource constraints. There are probably some minor improvements todo with bridging the gap between theoretically optimal architecture and the actual architecture, or parts of the algorithm that can be indefinitely improved but with diminishing returns (these probably exist due to Levin and possibly.matrix.multiplication is one of these). On the whole I expect AI research to be very chunky.
Indeed, we’ve seen that there was really just one big idea to all current AI progress: scaling, specifically scaling GPUs on maximally large undifferentiated datasets. There were some minor technical innovations needed to pull this off but on the whole that was the clinger.
Of course, I don’t know. Nobody knows. But I find this the most plausible guess based on what we know about intelligence, learning, theoretical computer science and science in general.
(Re: Difficult to Parse react on the other comment
I was confused about relevance of your comment above on chunky innovations, and it seems to be making some point (for which what it actually says is an argument), but I can’t figure out what it is. One clue was that it seems like you might be talking about innovations needed for superintelligence, while I was previously talking about possible absence of need for further innovations to reach autonomous researcher chatbots, an easier target. So I replied with formulating this distinction and some thoughts on the impact and conditions for reaching innovations of both kinds. Possibly the relevance of this was confusing in turn.)
There are two kinds of relevant hypothetical innovations: those that enable chatbot-led autonomous research, and those that enable superintelligence. It’s plausible that there is no need for (more of) the former, so that mere scaling through human efforts will lead to such chatbots in a few years regardless. (I think it’s essentially inevitable that there is currently enough compute that with appropriate innovations we can get such autonomous human-scale-genius chatbots, but it’s unclear if these innovations are necessary or easy to discover.) If autonomous chatbots are still anything like current LLMs, they are very fast compared to humans, so they quickly discover remaining major innovations of both kinds.
In principle, even if innovations that enable superintelligence (at scale feasible with human efforts in a few years) don’t exist at all, extremely fast autonomous research and engineering still lead to superintelligence, because they greatly accelerate scaling. Physical infrastructure might start scaling really fast using pathways like macroscopic biotech even if drexlerian nanotech is too hard without superintelligence or impossible in principle. Drosophila biomass doubles every 2 days, small things can assemble into large things.
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.
To some degree yes, they were not guaranteed to hold. But by that point they held for over 10 OOMs iirc and there was no known reason they couldn’t continue.
This might be the particular twitter bubble I was in but people definitely predicted capabilities beyond simple extrapolation of scaling laws.
Can you expand on what you mean by “create nanotech?” If improvements to our current photolithography techniques count, I would not be surprised if (scaffolded) LLMs could be useful for that. Likewise for getting bacteria to express polypeptide catalysts for useful reactions, and even maybe figure out how to chain several novel catalysts together to produce something useful (again, referring to scaffolded LLMs with access to tools).
If you mean that LLMs won’t be able to bootstrap from our current “nanotech only exists in biological systems and chip fabs” world to Drexler-style nanofactories, I agree with that, but I expect things will get crazy enough that I can’t predict them long before nanofactories are a thing (if they ever are).
Likewise, I don’t think LLMs can immediately obsolete all of the parts of my job. But they sure do make parts of my job a lot easier. If you have 100 workers that each spend 90% of their time on one specific task, and you automate that task, that’s approximately as useful as fully automating the jobs of 90 workers. “Human-equivalent” is one of those really leaky abstractions—I would be pretty surprised if the world had any significant resemblance to the world of today by the time robotic systems approached the dexterity and sensitivity of human hands for all of the tasks we use our hands for, whereas for the task of “lift heavy stuff” or “go really fast” machines left us in the dust long ago.
Iterative improvements on the timescale we’re likely to see are still likely to be pretty crazy by historical standards. But yeah, if your timelines were “end of the world by 2026” I can see why they’d be lengthening now.
My timelines were not 2026. In fact, I made bets against doomers 2-3 years ago, one will resolve by next year.
I agree iterative improvements are significant. This falls under “naive extrapolation of scaling laws”.
By nanotech I mean something akin to drexlerian nanotech or something similarly transformative in the vicinity. I think it is plausible that a true ASI will be able to make rapid progress (perhaps on the order of a few years or a decade) on nanotech. I suspect that people that don’t take this as a serious possibility haven’t really thought through what AGI/ASI means + what the limits and drivers of science and tech really are; I suspect they are simply falling prey to status-quo bias.
Lengthening from what to what?
I’ve never done explicit timelines estimates before so nothing to compare to. But since it’s a gut feeling anyway, I’m saying my gut is lengthening.
Agreed. I’m also pleasantly surprised that your take isn’t heavily downvoted.
Links to Dan Murfet’s AXRP interview:
Transcript
Video
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
State-of-the-art models such as Gemini aren’t LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
Chain-of-thought prompting makes models much more capable. In the original paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.
I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.
Those numbers don’t really accord with my experience actually using gpt-4. Generic prompting techniques just don’t help all that much.
I just asked GPT-4 a GSM8K problem and I agree with your point. I think what’s happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it’s no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to “respond with just a single number” to eliminate the chain-of-thought reasoning it’s problem-solving ability is much worse.
Mumble.