My calculation for AlphaStar: 12 agents * 44 days * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s * 32 TPUv3 boards * 33% actual board utilization = 2.02 * 10^23 FLOP which is about the same as AlphaGo Zero compute.
Meena was trained for 30 days on a TPUv3 pod with 2048 cores. So it’s 30 days * 24 hours/day * 3600 sec/hour * 2048 TPUv3 cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 1.8 * 10^23 FLOP, slightly below AlphaGo Zero.
Image GPT: “iGPT-L was trained for roughly 2500 V100-days”—this means 2500 days * 24 hours/day * 3600 sec/hour * 100*10^12 * 33% actual board utilization = 6.5 * 10^9 * 10^12 = 6.5 * 10^21 FLOP. There’s no compute data for the largest model, iGPT-XL. But based on the FLOP/s increase from GPT-3 XL (same num of params as iGPT-L) to GPT-3 6.7B (same num of params as iGPT-XL), I think it required 5 times more compute: 3.3 * 10^22 FLOP.
AlphaFold: they say they trained on GPU and not TPU. Assuming V100 GPU, it’s 5 days * 24 hours/day * 3600 sec/hour * 8 V100 GPU * 100*10^12 FLOP/s * 33% actual GPU utilization = 10^20 FLOP.
A previous calculation on LW gave 2.4 x 10^24 for AlphaStar (using values from the original alphastar blog post) which suggested that the trend was roughly on track.
The differences between the 2 calculations are (your values first):
Agents: 12 vs 600
Days: 44 vs 14
TPUs: 32 vs 16
Utilisation: 33% vs 50% (I think this is just estimated in the other calculation)
I appreciate questioning of my calculations, thanks for checking!
This is what I think about the previous avturchin calculation: I think that may have been a misinterpretation of DeepMind blogpost. In the blogpost they say “The AlphaStar league was run for 14 days, using 16 TPUs for each agent”. But I think it might not be 16 TPU-days for each agent, it’s 16 TPU for 14/n_agent=14/600 days for each agent. And 14 days was for the whole League training where agent policies were trained consecutively. Their wording is indeed not very clear but you can look at the “Progression of Nash of AlphaStar League” pic. You can see there that, as they say, “New competitors were dynamically added to the league, by branching from existing competitors”, and that the new ones drastically outperform older ones, meaning that older ones were not continuously updated and were only randomly picked up as static opponents.
From the blogpost: “A full technical description of this work is being prepared for publication in a peer-reviewed journal”. The only publication about this is their late-2019 Nature paper linked by teradimich here which I have taken the values from. They have upgraded their algorithm and have spent more compute in a single experiment by October 2019. 12 agents refers to the number of types of agents and 600 (900 in the newer version) refers to the number of policies. About the 33% GPU utilization value—I think I’ve seen it in some ML publications and in other places for this hardware, and this seems like a reasonable estimate for all these projects, but I don’t have sources at hand.
When we didn’t have enough information to directly count FLOPs, we looked GPU training time and total number of GPUs used and assumed a utilization efficiency (usually 0.33)
We trained the league using three main agents (one for each StarCraft race), three main exploiter agents (one for each race), and six league exploiter agents (two for each race). Each agent was trained using 32 third-generation tensor processing units (TPUs) over 44 days
My calculation for AlphaStar: 12 agents * 44 days * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s * 32 TPUv3 boards * 33% actual board utilization = 2.02 * 10^23 FLOP which is about the same as AlphaGo Zero compute.
For 600B GShard MoE model: 22 TPU core-years = 22 years * 365 days/year * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s/TPUv3 board * 0.25 TPU boards / TPU core * 0.33 actual board utilization = 2.4 * 10^21 FLOP.
For 2.3B GShard dense transformer: 235.5 TPU core-years = 2.6 * 10^22 FLOP.
Meena was trained for 30 days on a TPUv3 pod with 2048 cores. So it’s 30 days * 24 hours/day * 3600 sec/hour * 2048 TPUv3 cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 1.8 * 10^23 FLOP, slightly below AlphaGo Zero.
Image GPT: “iGPT-L was trained for roughly 2500 V100-days”—this means 2500 days * 24 hours/day * 3600 sec/hour * 100*10^12 * 33% actual board utilization = 6.5 * 10^9 * 10^12 = 6.5 * 10^21 FLOP. There’s no compute data for the largest model, iGPT-XL. But based on the FLOP/s increase from GPT-3 XL (same num of params as iGPT-L) to GPT-3 6.7B (same num of params as iGPT-XL), I think it required 5 times more compute: 3.3 * 10^22 FLOP.
BigGAN: 2 days * 24 hours/day * 3600 sec/hour * 512 TPU cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 3 * 10^21 FLOP.
AlphaFold: they say they trained on GPU and not TPU. Assuming V100 GPU, it’s 5 days * 24 hours/day * 3600 sec/hour * 8 V100 GPU * 100*10^12 FLOP/s * 33% actual GPU utilization = 10^20 FLOP.
A previous calculation on LW gave 2.4 x 10^24 for AlphaStar (using values from the original alphastar blog post) which suggested that the trend was roughly on track.
The differences between the 2 calculations are (your values first):
Agents: 12 vs 600
Days: 44 vs 14
TPUs: 32 vs 16
Utilisation: 33% vs 50% (I think this is just estimated in the other calculation)
Do you have a reference for the values you use?
I appreciate questioning of my calculations, thanks for checking!
This is what I think about the previous avturchin calculation: I think that may have been a misinterpretation of DeepMind blogpost. In the blogpost they say “The AlphaStar league was run for 14 days, using 16 TPUs for each agent”. But I think it might not be 16 TPU-days for each agent, it’s 16 TPU for 14/n_agent=14/600 days for each agent. And 14 days was for the whole League training where agent policies were trained consecutively. Their wording is indeed not very clear but you can look at the “Progression of Nash of AlphaStar League” pic. You can see there that, as they say, “New competitors were dynamically added to the league, by branching from existing competitors”, and that the new ones drastically outperform older ones, meaning that older ones were not continuously updated and were only randomly picked up as static opponents.
From the blogpost: “A full technical description of this work is being prepared for publication in a peer-reviewed journal”. The only publication about this is their late-2019 Nature paper linked by teradimich here which I have taken the values from. They have upgraded their algorithm and have spent more compute in a single experiment by October 2019. 12 agents refers to the number of types of agents and 600 (900 in the newer version) refers to the number of policies. About the 33% GPU utilization value—I think I’ve seen it in some ML publications and in other places for this hardware, and this seems like a reasonable estimate for all these projects, but I don’t have sources at hand.
Probably that:
This can be useful:
Correction: AlphaStar used 6*10^22 FLOP, not 2*10^23. You have mixed up TPU chips and TPU boards.
What is the GShard dense transformer you are referring to in this post?
It should be referenced here in Figure 1: https://arxiv.org/pdf/2006.16668.pdf