And I’d reject LSTM → transformer or MoE as an example because the quantitative effect size isn’t that big.
But if something like that made the difference between “this algorithm wasn’t scaling before, and now it’s scaling,” then I’d be surprised.
And the size of jump that surprises me is shrinking over time. So in a few years even getting the equivalent of a factor of 4 jump from some clever innovation would be very surprising to me.
The text you quoted was clarifying what “factor of 4” means in that sentence.
I’m not surprised by “googling relevant terms and then putting the results in context improves language modeling loss and performance on knowledge-loaded tasks.” This looks like basically a great implementation of that idea, along with really solid LM infrastructure in general.
I don’t really even have a firm quantitative prediction of how much this kind of thing will improve the LM loss in the world “in a few years” that the quote describes. Note that the effect of this result on downstream performance is almost certainly (much) less than its effect on LM loss, because for most applications you will already be doing something to get relevant information into the LM context (especially for a task that was anywhere near as knowledge-loaded as the LM task, which is usually pretty light on reasoning).
(ETA: as Veedrac points out, it also looks on a first skim like quite a lot of the difference is due to more effectively memorizing nearly-identical text that appeared in the training set, which is even less helpful for downstream performance. So sticking with “even if this is a 10x gain on LM task according to the formal specification, it’s not nearly such a big deal for downstream tasks.”)
My logic for making predictions about this kind of thing is roughly:
In the next few years LM inference will be using large amounts of compute, with costs likely measured in hundreds of millions per year.
Engineering effort to improve performance on the applications people care about is likely to be in the hundreds of millions or billions.
If low-hanging fruit like this hasn’t been plucked at that point, people are probably correct that it’s not going to be a huge effect relative to the things they are spending money on.
People will probably be in the regime where doubling model quality costs more like $100M than $10M. Sometimes you will get lucky and find a big winner, but almost never so lucky as “clever trick that’s worth 4x. (Right now I think we are maybe at $10M/doubling though I don’t really know and wouldn’t want to speak specifically if I did.)
(And it won’t be long after that before doubling model quality for important tasks costs $1B+, e.g. semiconductors are probably in the world where it costs $100B or something to cut costs by half. You’ll still sometimes have crazy great startups that spend $100M and manage to get 4x, which grow to huge valuations quickly, but it predictably gets rarer and rarer.)
I agree with you on tasks where there is not a lot of headroom. But on tasks like International Olympiad level mathematics and programming 4x reduction in model size keeping performance constant will be small. I expect many 1000x and bigger improvements vs. what scaling laws would predict currently.
For example, on MATH dataset “(...) models would need around 10^35 parameters to achieve 40% accuracy” where 40% accuracy is achieved by a PhD student and International Olympiad participant will get close to 90%. https://arxiv.org/abs/2103.03874
With 100 trillion models (10^14) we would still be short by 10^21 parameters. So we will need to get some 20 orders of magnitude improvements in model size for the same performance from somewhere else.
Worth noticing the 40% vs. 90% gap for expert humans on MATH. And similar gap on MMLU (Massive Multitask Language Understanding) 35% for average human vs. 90% experts. Experts don’t have orders of magnitude bigger brains, different architecture or learning algorithm in their brains.
When replying, I also noticed that I made assumptions about what mean by x factor quality improvement. I’m not sure I understood correctly. Could you clarify what you meant precisely?
If you have big communities working on math, I don’t think you will see improvements like 1000x model size (the bigger the community, the harder it will be to get any fixed size of advantage). And I think you will have big communities working on the problem well before it becomes a big deal economically (the bigger the economic deal, the bigger the community). Both of those are quantitative and imperfect and uncertain, but I think they are pretty important rules of thumb for making sense of what happens in the world.
Regarding the IMO disagreement, I think it’s very plausible the IMO will be solved before there is a giant community. So that’s more of a claim that even now, with not many people working on it, you probably aren’t going to get progress that fast. I don’t feel like this speaks to either the two main disagreements with Eliezer, but it does speak to something like “How often do we see jumps that look big to Paul?” where I’m claiming that I have a better sense for what improvements are “surprisingly big.”
Just to give the full quote:
The text you quoted was clarifying what “factor of 4” means in that sentence.
I’m not surprised by “googling relevant terms and then putting the results in context improves language modeling loss and performance on knowledge-loaded tasks.” This looks like basically a great implementation of that idea, along with really solid LM infrastructure in general.
I don’t really even have a firm quantitative prediction of how much this kind of thing will improve the LM loss in the world “in a few years” that the quote describes. Note that the effect of this result on downstream performance is almost certainly (much) less than its effect on LM loss, because for most applications you will already be doing something to get relevant information into the LM context (especially for a task that was anywhere near as knowledge-loaded as the LM task, which is usually pretty light on reasoning).
(ETA: as Veedrac points out, it also looks on a first skim like quite a lot of the difference is due to more effectively memorizing nearly-identical text that appeared in the training set, which is even less helpful for downstream performance. So sticking with “even if this is a 10x gain on LM task according to the formal specification, it’s not nearly such a big deal for downstream tasks.”)
My logic for making predictions about this kind of thing is roughly:
In the next few years LM inference will be using large amounts of compute, with costs likely measured in hundreds of millions per year.
Engineering effort to improve performance on the applications people care about is likely to be in the hundreds of millions or billions.
If low-hanging fruit like this hasn’t been plucked at that point, people are probably correct that it’s not going to be a huge effect relative to the things they are spending money on.
People will probably be in the regime where doubling model quality costs more like $100M than $10M. Sometimes you will get lucky and find a big winner, but almost never so lucky as “clever trick that’s worth 4x. (Right now I think we are maybe at $10M/doubling though I don’t really know and wouldn’t want to speak specifically if I did.)
(And it won’t be long after that before doubling model quality for important tasks costs $1B+, e.g. semiconductors are probably in the world where it costs $100B or something to cut costs by half. You’ll still sometimes have crazy great startups that spend $100M and manage to get 4x, which grow to huge valuations quickly, but it predictably gets rarer and rarer.)
I agree with you on tasks where there is not a lot of headroom. But on tasks like International Olympiad level mathematics and programming 4x reduction in model size keeping performance constant will be small. I expect many 1000x and bigger improvements vs. what scaling laws would predict currently.
For example, on MATH dataset “(...) models would need around 10^35 parameters to achieve 40% accuracy” where 40% accuracy is achieved by a PhD student and International Olympiad participant will get close to 90%. https://arxiv.org/abs/2103.03874
With 100 trillion models (10^14) we would still be short by 10^21 parameters. So we will need to get some 20 orders of magnitude improvements in model size for the same performance from somewhere else.
Worth noticing the 40% vs. 90% gap for expert humans on MATH. And similar gap on MMLU (Massive Multitask Language Understanding) 35% for average human vs. 90% experts. Experts don’t have orders of magnitude bigger brains, different architecture or learning algorithm in their brains.
When replying, I also noticed that I made assumptions about what mean by x factor quality improvement. I’m not sure I understood correctly. Could you clarify what you meant precisely?
If you have big communities working on math, I don’t think you will see improvements like 1000x model size (the bigger the community, the harder it will be to get any fixed size of advantage). And I think you will have big communities working on the problem well before it becomes a big deal economically (the bigger the economic deal, the bigger the community). Both of those are quantitative and imperfect and uncertain, but I think they are pretty important rules of thumb for making sense of what happens in the world.
Regarding the IMO disagreement, I think it’s very plausible the IMO will be solved before there is a giant community. So that’s more of a claim that even now, with not many people working on it, you probably aren’t going to get progress that fast. I don’t feel like this speaks to either the two main disagreements with Eliezer, but it does speak to something like “How often do we see jumps that look big to Paul?” where I’m claiming that I have a better sense for what improvements are “surprisingly big.”