In appendix A.6 they state “To train an Atari agent for 100k steps, it only needs 4 GPUs to train 7 hours.” I don’t think they provide a summary of the total number of parameters. Scanning the described architecture though, it does not look like a lot—almost surely < 1B.
A single ALE game is just not that complex, so ALE models are never large. (Uber once did an interesting paper on quantifying how small the NNs could be for ALE games.) The parameter count will be much closer to 1m than 1b. You can also look at the layer types & sizes in Appendix A: even without trying to calculate out anything, with a few convolution layers and a normal-sized LSTM layer and not much else, there’s simply no way it’s anywhere near 1b.
(As is pretty much always the case. Models in DRL are usually small, especially compared to supervised/self-supervised stuff. I’m not sure there’s even a single ‘pure DRL’ model which cracks the 1b scale. The biggest chonkers might be like, MetaMimic or AlphaZero or AlphaStar, which would be in the low hundreds? So that’s probably why DRL papers are not in the habit of reporting parameter counts everywhere or scaling them up/down like you might assume these days. That’ll have to change as more self-supervised models are used.)
Great post!
Do you mind if I ask you what is the amount of free parameters and training compute of EfficientZero?
I tried scanning the paper but didn’t find them readily available.
In appendix A.6 they state “To train an Atari agent for 100k steps, it only needs 4 GPUs to train 7 hours.” I don’t think they provide a summary of the total number of parameters. Scanning the described architecture though, it does not look like a lot—almost surely < 1B.
A single ALE game is just not that complex, so ALE models are never large. (Uber once did an interesting paper on quantifying how small the NNs could be for ALE games.) The parameter count will be much closer to 1m than 1b. You can also look at the layer types & sizes in Appendix A: even without trying to calculate out anything, with a few convolution layers and a normal-sized LSTM layer and not much else, there’s simply no way it’s anywhere near 1b.
(As is pretty much always the case. Models in DRL are usually small, especially compared to supervised/self-supervised stuff. I’m not sure there’s even a single ‘pure DRL’ model which cracks the 1b scale. The biggest chonkers might be like, MetaMimic or AlphaZero or AlphaStar, which would be in the low hundreds? So that’s probably why DRL papers are not in the habit of reporting parameter counts everywhere or scaling them up/down like you might assume these days. That’ll have to change as more self-supervised models are used.)