Charlie Steiner comments on Remaking EfficientZero (as best I can)

Charlie Steiner 4 Jul 2022 14:18 UTC
3 points
Very nice! I think the original paper used a ~20GB model and ~20 CPUs (so 20 or some small multiple GPUs), to train in 7 hours. How much have you shrunk it to make training doable and tolerable on colab?
- Hoagy 4 Jul 2022 16:50 UTC
  1 point
  Parent
  Where does the 20GB number come from? I can’t see it in a quick scan of the paper. In general, the model itself isn’t that huge, in particular, they really shrink the dynamics net down to a single ResNet, it’s mostly the representation function which has the parameters.
  
  Despite that, training on Colab isn’t really that tolerable. In 10h of training it does about 100k steps but I think only a small fraction of the amount of training of the paper’s implementation (though I don’t have numbers on how many batches they get through) so it’s very hard to work out whether it’s training or if a change has had a positive effect.
  
  Basically I’ve been able to use what I think is the full model size, but the quantity of training is pretty low. I’m not sure if I’d have better performance if I cut the model size to speed up the training loop but I suspect not because that’s not been visible in CPU runs (but then I may also just have some error in my Atari-playing code).
  - Charlie Steiner 4 Jul 2022 17:06 UTC
    2 points
    Parent
    Ah. I was remembering something about 20GB from their github, but it looks like it doesn’t correspond to model size like I thought. (I also forgot about the factor of ~3 difference between model size on disk and GPU usage, but even beyond that...)
    - Hoagy 5 Jul 2022 9:49 UTC
      1 point
      Parent
      Ah cheers, I’d not noticed that, trying to avoid looking too much. The way I understood it was that the DRAM usage corresponded very roughly to n_parameters * batch_size and with the batch_size I was able to tune the memory usage easily.
      
      I’d not heard about the factor of 3, is that some particular trick for minimizing the GPU RAM cost?