I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim.
I think that was largely settled by the earlier work on transfer scaling laws and Bayesian hierarchical interpretations: pretraining provides an informative prior which increases sample-efficiency in a related task, providing in essence a fixed n sample gain. But enough data washes out the prior, whether informative or uniform. So if you have enough data and compute (which you usually don’t), transfer results in the same final performance. This is also true in, say, image classification. Stuff like CLIP is great for transfer learning—unless you have millions of labeled images in your target domain, in which case, yeah sure it’ll probably just match the from-scratch model. (How else could things possibly work?) 715GB of Github is definitely large enough that it washes out the natural language prior! But as the Copilot paper also points out, you still get a big benefit in that you can train a lot less when you start with GPT-3 as the prior. Nothing to sneeze at, and I suspect DM would’ve gotten even better results here if they had started with Gopher rather than their from-scratch dataset as their compute budget would stretch much further and they could do much less brute-force rejection sampling of program candidates.
It’s also direction-specific. Natural language may wash out when you target Github… but does Github wash out when you target natural language? I will be curious to see if anyone tries using the coding models as the prior for language modeling instead of vice-versa, and if that leads to noticeable gains on the various reasoning-esque benchmarks.
Table 7 shows that pre-training on a natural language corpus slightly degrades performance compared to pre-training on Github.
But it does not show that pretraining on natural language plus Github is worse than Github-only. This is also what you’d expect from Copilot showing that GPT-3 (natural language) initialization trains much faster to the same performance level.
I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn’t cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn’t have the compute budget to do its approach optimally.
Possibly but the timelines don’t quite seem to line up. On Twitter, DMers are describing this as a 2-year project, implying AlphaCode started ~February 2020. GPT-3 wouldn’t come out until May 2020 and obviously Codex/Copilot didn’t come out until mid-2021, but there were already Transformer for code generation (even assistance ones like TabNine) and so this is pretty much the obvious way to keep going and ‘2 years’ is entirely plausible as the timespan. Now, Gopher is described as starting (or was it finishing?) training in December 2020, so it became available about half-way through: they had all of 2021 & January 2022 to drop Gopher into their training & evaluation framework. I know there’s always a lot of inertia and everything always takes longer than outsiders predict on projects this complicated (look at the acknowledgements and staff section)… but I think that’s probably enough time that they could have used Gopher if they had really wanted to, unless this project was very frontloaded and mostly done by the time Gopher came around and they spent most of 2021 doing stuff like writing it up or evaluating it?
It seems equally plausible to me that they ran out of their allotted compute by the time Gopher came around and that even if it would be on net more efficient to train a Gopher, they had already spent their quota. DM doesn’t have an infinite budget and can’t pursue everything to the logical endpoint (like how AlphaStar was ignominously dropped right around where it had added harder APM limits / was using the camera like a human / training on all races+maps, but was only human-pro-level and hadn’t AlphaGo’d humans yet).
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!
I think that was largely settled by the earlier work on transfer scaling laws and Bayesian hierarchical interpretations: pretraining provides an informative prior which increases sample-efficiency in a related task, providing in essence a fixed n sample gain. But enough data washes out the prior, whether informative or uniform. So if you have enough data and compute (which you usually don’t), transfer results in the same final performance. This is also true in, say, image classification. Stuff like CLIP is great for transfer learning—unless you have millions of labeled images in your target domain, in which case, yeah sure it’ll probably just match the from-scratch model. (How else could things possibly work?) 715GB of Github is definitely large enough that it washes out the natural language prior! But as the Copilot paper also points out, you still get a big benefit in that you can train a lot less when you start with GPT-3 as the prior. Nothing to sneeze at, and I suspect DM would’ve gotten even better results here if they had started with Gopher rather than their from-scratch dataset as their compute budget would stretch much further and they could do much less brute-force rejection sampling of program candidates.
It’s also direction-specific. Natural language may wash out when you target Github… but does Github wash out when you target natural language? I will be curious to see if anyone tries using the coding models as the prior for language modeling instead of vice-versa, and if that leads to noticeable gains on the various reasoning-esque benchmarks.
But it does not show that pretraining on natural language plus Github is worse than Github-only. This is also what you’d expect from Copilot showing that GPT-3 (natural language) initialization trains much faster to the same performance level.
I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn’t cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn’t have the compute budget to do its approach optimally.
Why didn’t they use Gopher then for AlphaCode? Maybe Gopher wasn’t done training yet?
Possibly but the timelines don’t quite seem to line up. On Twitter, DMers are describing this as a 2-year project, implying AlphaCode started ~February 2020. GPT-3 wouldn’t come out until May 2020 and obviously Codex/Copilot didn’t come out until mid-2021, but there were already Transformer for code generation (even assistance ones like TabNine) and so this is pretty much the obvious way to keep going and ‘2 years’ is entirely plausible as the timespan. Now, Gopher is described as starting (or was it finishing?) training in December 2020, so it became available about half-way through: they had all of 2021 & January 2022 to drop Gopher into their training & evaluation framework. I know there’s always a lot of inertia and everything always takes longer than outsiders predict on projects this complicated (look at the acknowledgements and staff section)… but I think that’s probably enough time that they could have used Gopher if they had really wanted to, unless this project was very frontloaded and mostly done by the time Gopher came around and they spent most of 2021 doing stuff like writing it up or evaluating it?
It seems equally plausible to me that they ran out of their allotted compute by the time Gopher came around and that even if it would be on net more efficient to train a Gopher, they had already spent their quota. DM doesn’t have an infinite budget and can’t pursue everything to the logical endpoint (like how AlphaStar was ignominously dropped right around where it had added harder APM limits / was using the camera like a human / training on all races+maps, but was only human-pro-level and hadn’t AlphaGo’d humans yet).
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!