Michael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation. This is consistent with AlphaStar I believe, where the number of agents was something like that IIRC...
Anyhow, suppose 30 agents per generation. Then that’s a cost of $5,000/mo x 1.3 months x 30 agents = $195,000 to train the fifth generation of agents. The previous two generations were probably quicker and cheaper. In total the price is probably, therefore, something like half a million dollars of compute?
This seems surprisingly low to me. About one order of magnitude less than I expected. What’s going on? Maybe it really was that cheap. If so, why? Has the price dropped since AlphaStar? Probably… It’s also possible this just used less compute than AlphaStar did...
Michael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation.
Makes sense given the spinning-top topology of games. These tasks are probably not complex enough to need a lot of distinct agents/populations to traverse the wide part to reach the top where you then need little diversity to converge on value-equivalent models.
Has the price dropped since AlphaStar?
One observation: you can’t run SC2 environments on a TPU, and when you can pack the environment and agents together onto a TPU and batch everything with no copying, you use the hardware closer to its full potential, see the Podracer numbers.
Only Anakin actually runs the environment on the TPU, and this only works for pretty simple environments (basically: can you implement it in JAX?) Sebulba runs environments on the host, which is what would have been done for this paper too (no idea if they used Sebulba or had a different setup).
This doesn’t really matter though, because for these simulated environments it’s fairly simple to fully utilize the TPUs by running more (remote) environments in parallel.
Yes, I see that they used Unity, so the TPUs themselves couldn’t run the env, but the TPU CPU VM* could run potentially a lot of copies (with that like 300GB of RAM it’s got access to), and that’d be a lot nicer than running remote VMs. At least in Tensorfork, when we try to use TPU pods, a lot of time goes into figuring out correct use of the interconnect & traffic because the on-TPU ops are so optimized by default.
(And regardless of which of those tricks this open-ended paper uses, this is a point well worth knowing about how research could potentially gets way more performance out of a TPU pod than one would expect from knowing TPU usage of old stuff like AlphaStar.)
* advertisement: access to the VM was recently unlocked for non-Google TPU users. It really changes how you treat TPU use!
Michael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation. This is consistent with AlphaStar I believe, where the number of agents was something like that IIRC...
Anyhow, suppose 30 agents per generation. Then that’s a cost of $5,000/mo x 1.3 months x 30 agents = $195,000 to train the fifth generation of agents. The previous two generations were probably quicker and cheaper. In total the price is probably, therefore, something like half a million dollars of compute?
This seems surprisingly low to me. About one order of magnitude less than I expected. What’s going on? Maybe it really was that cheap. If so, why? Has the price dropped since AlphaStar? Probably… It’s also possible this just used less compute than AlphaStar did...
Makes sense given the spinning-top topology of games. These tasks are probably not complex enough to need a lot of distinct agents/populations to traverse the wide part to reach the top where you then need little diversity to converge on value-equivalent models.
One observation: you can’t run SC2 environments on a TPU, and when you can pack the environment and agents together onto a TPU and batch everything with no copying, you use the hardware closer to its full potential, see the Podracer numbers.
Only Anakin actually runs the environment on the TPU, and this only works for pretty simple environments (basically: can you implement it in JAX?) Sebulba runs environments on the host, which is what would have been done for this paper too (no idea if they used Sebulba or had a different setup).
This doesn’t really matter though, because for these simulated environments it’s fairly simple to fully utilize the TPUs by running more (remote) environments in parallel.
Yes, I see that they used Unity, so the TPUs themselves couldn’t run the env, but the TPU CPU VM* could run potentially a lot of copies (with that like 300GB of RAM it’s got access to), and that’d be a lot nicer than running remote VMs. At least in Tensorfork, when we try to use TPU pods, a lot of time goes into figuring out correct use of the interconnect & traffic because the on-TPU ops are so optimized by default.
(And regardless of which of those tricks this open-ended paper uses, this is a point well worth knowing about how research could potentially gets way more performance out of a TPU pod than one would expect from knowing TPU usage of old stuff like AlphaStar.)
* advertisement: access to the VM was recently unlocked for non-Google TPU users. It really changes how you treat TPU use!