It’s good to see DeepMind addressing ethical/safety aspects of their work. The linked blog post isn’t the only thing DeepMind published about the new model. Here is a very long report about many different aspects of the model. Of particular interest is:
We separately consider a retrieval mechanism searching over the training set for relevant extracts during pre-training (Borgeaud et al., 2021), partially avoiding the need to memorise knowledge into network weights. This approach reached GPT-3-level language model performance with a 7 billion parameter model and over a 10× reduction in training compute.
like I’m surprised if a clever innovation does more good than spending 4x more compute
when discussing different predictions made by his vs Yudkowsky‘s models of AI progress. However, Paul was specifically referring to a “clever innovation” which occurs several years in the future. If DeepMind’s 10x more efficient claim holds up, is that a bigger jump than Paul predicted would be plausible today?
And I’d reject LSTM → transformer or MoE as an example because the quantitative effect size isn’t that big.
But if something like that made the difference between “this algorithm wasn’t scaling before, and now it’s scaling,” then I’d be surprised.
And the size of jump that surprises me is shrinking over time. So in a few years even getting the equivalent of a factor of 4 jump from some clever innovation would be very surprising to me.
The text you quoted was clarifying what “factor of 4” means in that sentence.
I’m not surprised by “googling relevant terms and then putting the results in context improves language modeling loss and performance on knowledge-loaded tasks.” This looks like basically a great implementation of that idea, along with really solid LM infrastructure in general.
I don’t really even have a firm quantitative prediction of how much this kind of thing will improve the LM loss in the world “in a few years” that the quote describes. Note that the effect of this result on downstream performance is almost certainly (much) less than its effect on LM loss, because for most applications you will already be doing something to get relevant information into the LM context (especially for a task that was anywhere near as knowledge-loaded as the LM task, which is usually pretty light on reasoning).
(ETA: as Veedrac points out, it also looks on a first skim like quite a lot of the difference is due to more effectively memorizing nearly-identical text that appeared in the training set, which is even less helpful for downstream performance. So sticking with “even if this is a 10x gain on LM task according to the formal specification, it’s not nearly such a big deal for downstream tasks.”)
My logic for making predictions about this kind of thing is roughly:
In the next few years LM inference will be using large amounts of compute, with costs likely measured in hundreds of millions per year.
Engineering effort to improve performance on the applications people care about is likely to be in the hundreds of millions or billions.
If low-hanging fruit like this hasn’t been plucked at that point, people are probably correct that it’s not going to be a huge effect relative to the things they are spending money on.
People will probably be in the regime where doubling model quality costs more like $100M than $10M. Sometimes you will get lucky and find a big winner, but almost never so lucky as “clever trick that’s worth 4x. (Right now I think we are maybe at $10M/doubling though I don’t really know and wouldn’t want to speak specifically if I did.)
(And it won’t be long after that before doubling model quality for important tasks costs $1B+, e.g. semiconductors are probably in the world where it costs $100B or something to cut costs by half. You’ll still sometimes have crazy great startups that spend $100M and manage to get 4x, which grow to huge valuations quickly, but it predictably gets rarer and rarer.)
I agree with you on tasks where there is not a lot of headroom. But on tasks like International Olympiad level mathematics and programming 4x reduction in model size keeping performance constant will be small. I expect many 1000x and bigger improvements vs. what scaling laws would predict currently.
For example, on MATH dataset “(...) models would need around 10^35 parameters to achieve 40% accuracy” where 40% accuracy is achieved by a PhD student and International Olympiad participant will get close to 90%. https://arxiv.org/abs/2103.03874
With 100 trillion models (10^14) we would still be short by 10^21 parameters. So we will need to get some 20 orders of magnitude improvements in model size for the same performance from somewhere else.
Worth noticing the 40% vs. 90% gap for expert humans on MATH. And similar gap on MMLU (Massive Multitask Language Understanding) 35% for average human vs. 90% experts. Experts don’t have orders of magnitude bigger brains, different architecture or learning algorithm in their brains.
When replying, I also noticed that I made assumptions about what mean by x factor quality improvement. I’m not sure I understood correctly. Could you clarify what you meant precisely?
If you have big communities working on math, I don’t think you will see improvements like 1000x model size (the bigger the community, the harder it will be to get any fixed size of advantage). And I think you will have big communities working on the problem well before it becomes a big deal economically (the bigger the economic deal, the bigger the community). Both of those are quantitative and imperfect and uncertain, but I think they are pretty important rules of thumb for making sense of what happens in the world.
Regarding the IMO disagreement, I think it’s very plausible the IMO will be solved before there is a giant community. So that’s more of a claim that even now, with not many people working on it, you probably aren’t going to get progress that fast. I don’t feel like this speaks to either the two main disagreements with Eliezer, but it does speak to something like “How often do we see jumps that look big to Paul?” where I’m claiming that I have a better sense for what improvements are “surprisingly big.”
Also worth noting is that the model was trained in December 2020, a year ago. I don’t know when GPT-3 was trained, but if the time-gap between the two is small, that sure looks like a substantial discontinuity in training efficiency. (Though I’d prefer to see long-run data).
The observation that two SOTA language models trained close together in time were substantially different in measured performance provides evidence of a discontinuity, as defined in the usual sense of a large residual from prior extrapolation.
I can answer your question literally: I don’t think that would be infinitely fast progress. I am genuinely unsure what your point is though. :)
I think there’s a significant point here: that it only makes sense to compare with the expected trend rather than with one data point. In particular, note that if Gopher had been released one day before GPT-3, then GPT-3 wouldn’t have been SOTA, and the time-to-achieve-x-progress would look a lot longer.
Skimming the Rᴇᴛʀᴏ paper is weird because it looks like there’s leakage everywhere, they admit leakage is everywhere, but then they sort of report results like it doesn’t matter, even putting a result on their leakiest dataset in their conclusion?
On Wikitext103 and the Pile, Retro outperforms previous models trained on large scale datasets.
It looks to me like Figure 6 is saying the improvement is fairly modest in unleaky datasets?
Maybe someone who has gone over the paper in detail can chime in with thoughts.
Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven’t done it?
Edit: Oh. I have now looked at the Retro paper. I’d still be interested in hearing your take on what makes different datasets leaky.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.
However, he was specifically referring to a clever innovation which occurs several years in the future. If the DeepMind’s 10x more efficient claim holds up, is that a bigger jump than Paul Christiano predicted would be plausible today?
I’d also be interested in hearing Paul/Eliezer’s takes on RETRO, though I don’t think they reported compute scaling curves in their paper?
Headline quote:
With a 2 trillion token database, our Retrieval-Enhanced Transformer (Retro) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters.
Relevant figures for performance are figure 1 and figure 3
It’s good to see DeepMind addressing ethical/safety aspects of their work. The linked blog post isn’t the only thing DeepMind published about the new model. Here is a very long report about many different aspects of the model. Of particular interest is:
In More Christiano, Cotra, and Yudkowsky on AI progress, Paul Christiano said:
when discussing different predictions made by his vs Yudkowsky‘s models of AI progress. However, Paul was specifically referring to a “clever innovation” which occurs several years in the future. If DeepMind’s 10x more efficient claim holds up, is that a bigger jump than Paul predicted would be plausible today?
Just to give the full quote:
The text you quoted was clarifying what “factor of 4” means in that sentence.
I’m not surprised by “googling relevant terms and then putting the results in context improves language modeling loss and performance on knowledge-loaded tasks.” This looks like basically a great implementation of that idea, along with really solid LM infrastructure in general.
I don’t really even have a firm quantitative prediction of how much this kind of thing will improve the LM loss in the world “in a few years” that the quote describes. Note that the effect of this result on downstream performance is almost certainly (much) less than its effect on LM loss, because for most applications you will already be doing something to get relevant information into the LM context (especially for a task that was anywhere near as knowledge-loaded as the LM task, which is usually pretty light on reasoning).
(ETA: as Veedrac points out, it also looks on a first skim like quite a lot of the difference is due to more effectively memorizing nearly-identical text that appeared in the training set, which is even less helpful for downstream performance. So sticking with “even if this is a 10x gain on LM task according to the formal specification, it’s not nearly such a big deal for downstream tasks.”)
My logic for making predictions about this kind of thing is roughly:
In the next few years LM inference will be using large amounts of compute, with costs likely measured in hundreds of millions per year.
Engineering effort to improve performance on the applications people care about is likely to be in the hundreds of millions or billions.
If low-hanging fruit like this hasn’t been plucked at that point, people are probably correct that it’s not going to be a huge effect relative to the things they are spending money on.
People will probably be in the regime where doubling model quality costs more like $100M than $10M. Sometimes you will get lucky and find a big winner, but almost never so lucky as “clever trick that’s worth 4x. (Right now I think we are maybe at $10M/doubling though I don’t really know and wouldn’t want to speak specifically if I did.)
(And it won’t be long after that before doubling model quality for important tasks costs $1B+, e.g. semiconductors are probably in the world where it costs $100B or something to cut costs by half. You’ll still sometimes have crazy great startups that spend $100M and manage to get 4x, which grow to huge valuations quickly, but it predictably gets rarer and rarer.)
I agree with you on tasks where there is not a lot of headroom. But on tasks like International Olympiad level mathematics and programming 4x reduction in model size keeping performance constant will be small. I expect many 1000x and bigger improvements vs. what scaling laws would predict currently.
For example, on MATH dataset “(...) models would need around 10^35 parameters to achieve 40% accuracy” where 40% accuracy is achieved by a PhD student and International Olympiad participant will get close to 90%. https://arxiv.org/abs/2103.03874
With 100 trillion models (10^14) we would still be short by 10^21 parameters. So we will need to get some 20 orders of magnitude improvements in model size for the same performance from somewhere else.
Worth noticing the 40% vs. 90% gap for expert humans on MATH. And similar gap on MMLU (Massive Multitask Language Understanding) 35% for average human vs. 90% experts. Experts don’t have orders of magnitude bigger brains, different architecture or learning algorithm in their brains.
When replying, I also noticed that I made assumptions about what mean by x factor quality improvement. I’m not sure I understood correctly. Could you clarify what you meant precisely?
If you have big communities working on math, I don’t think you will see improvements like 1000x model size (the bigger the community, the harder it will be to get any fixed size of advantage). And I think you will have big communities working on the problem well before it becomes a big deal economically (the bigger the economic deal, the bigger the community). Both of those are quantitative and imperfect and uncertain, but I think they are pretty important rules of thumb for making sense of what happens in the world.
Regarding the IMO disagreement, I think it’s very plausible the IMO will be solved before there is a giant community. So that’s more of a claim that even now, with not many people working on it, you probably aren’t going to get progress that fast. I don’t feel like this speaks to either the two main disagreements with Eliezer, but it does speak to something like “How often do we see jumps that look big to Paul?” where I’m claiming that I have a better sense for what improvements are “surprisingly big.”
Also worth noting is that the model was trained in December 2020, a year ago. I don’t know when GPT-3 was trained, but if the time-gap between the two is small, that sure looks like a substantial discontinuity in training efficiency. (Though I’d prefer to see long-run data).
If two people trained language models at the same time and one was better than the other, would you call it infinitely fast progress?
I’m confused what you’re asking.
The observation that two SOTA language models trained close together in time were substantially different in measured performance provides evidence of a discontinuity, as defined in the usual sense of a large residual from prior extrapolation.
I can answer your question literally: I don’t think that would be infinitely fast progress. I am genuinely unsure what your point is though. :)
I think there’s a significant point here: that it only makes sense to compare with the expected trend rather than with one data point.
In particular, note that if Gopher had been released one day before GPT-3, then GPT-3 wouldn’t have been SOTA, and the time-to-achieve-x-progress would look a lot longer.
(FWIW, it still seems like a discontinuity to me)
GPT-3 appeared on arXiv in May 2020: https://arxiv.org/abs/2005.14165
Though I don’t know exactly when it was trained.
It was trained with internet data from October 2019. So it must have been trained between October 2019 and May 2020.
Skimming the Rᴇᴛʀᴏ paper is weird because it looks like there’s leakage everywhere, they admit leakage is everywhere, but then they sort of report results like it doesn’t matter, even putting a result on their leakiest dataset in their conclusion?
It looks to me like Figure 6 is saying the improvement is fairly modest in unleaky datasets?
Maybe someone who has gone over the paper in detail can chime in with thoughts.
Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven’t done it?
Edit: Oh. I have now looked at the Retro paper. I’d still be interested in hearing your take on what makes different datasets leaky.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.
I’d also be interested in hearing Paul/Eliezer’s takes on RETRO, though I don’t think they reported compute scaling curves in their paper?
Headline quote:
Relevant figures for performance are figure 1 and figure 3