Julian Schrittwieser comments on Parameter counts in Machine Learning

Julian Schrittwieser 31 Aug 2021 20:31 UTC
4 points
The difference in compute between AlexNet and AlphaZero is because for AlexNet you are only counting the flops during training, while for AlphaZero you are counting both the training and the self-play data generation (which does 800 forwards per move * ~200 moves to generate each game).
If you were to compare supervised training numbers for both (e.g. training on human chess or Go games) then you’d get much closer.
- Rohin Shah 1 Sep 2021 13:07 UTC
  3 points
  Parent
  That’s fair. I was thinking of that as part of “compute needed during training”, but you could also split it up into “compute needed for gradient updates” and “compute needed to create data of sufficient quality”, and then say that the stable thing is the “compute needed for gradient updates”.