I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn’t cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn’t have the compute budget to do its approach optimally.
Possibly but the timelines don’t quite seem to line up. On Twitter, DMers are describing this as a 2-year project, implying AlphaCode started ~February 2020. GPT-3 wouldn’t come out until May 2020 and obviously Codex/Copilot didn’t come out until mid-2021, but there were already Transformer for code generation (even assistance ones like TabNine) and so this is pretty much the obvious way to keep going and ‘2 years’ is entirely plausible as the timespan. Now, Gopher is described as starting (or was it finishing?) training in December 2020, so it became available about half-way through: they had all of 2021 & January 2022 to drop Gopher into their training & evaluation framework. I know there’s always a lot of inertia and everything always takes longer than outsiders predict on projects this complicated (look at the acknowledgements and staff section)… but I think that’s probably enough time that they could have used Gopher if they had really wanted to, unless this project was very frontloaded and mostly done by the time Gopher came around and they spent most of 2021 doing stuff like writing it up or evaluating it?
It seems equally plausible to me that they ran out of their allotted compute by the time Gopher came around and that even if it would be on net more efficient to train a Gopher, they had already spent their quota. DM doesn’t have an infinite budget and can’t pursue everything to the logical endpoint (like how AlphaStar was ignominously dropped right around where it had added harder APM limits / was using the camera like a human / training on all races+maps, but was only human-pro-level and hadn’t AlphaGo’d humans yet).
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!
I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn’t cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn’t have the compute budget to do its approach optimally.
Why didn’t they use Gopher then for AlphaCode? Maybe Gopher wasn’t done training yet?
Possibly but the timelines don’t quite seem to line up. On Twitter, DMers are describing this as a 2-year project, implying AlphaCode started ~February 2020. GPT-3 wouldn’t come out until May 2020 and obviously Codex/Copilot didn’t come out until mid-2021, but there were already Transformer for code generation (even assistance ones like TabNine) and so this is pretty much the obvious way to keep going and ‘2 years’ is entirely plausible as the timespan. Now, Gopher is described as starting (or was it finishing?) training in December 2020, so it became available about half-way through: they had all of 2021 & January 2022 to drop Gopher into their training & evaluation framework. I know there’s always a lot of inertia and everything always takes longer than outsiders predict on projects this complicated (look at the acknowledgements and staff section)… but I think that’s probably enough time that they could have used Gopher if they had really wanted to, unless this project was very frontloaded and mostly done by the time Gopher came around and they spent most of 2021 doing stuff like writing it up or evaluating it?
It seems equally plausible to me that they ran out of their allotted compute by the time Gopher came around and that even if it would be on net more efficient to train a Gopher, they had already spent their quota. DM doesn’t have an infinite budget and can’t pursue everything to the logical endpoint (like how AlphaStar was ignominously dropped right around where it had added harder APM limits / was using the camera like a human / training on all races+maps, but was only human-pro-level and hadn’t AlphaGo’d humans yet).
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!