Where does the 20GB number come from? I can’t see it in a quick scan of the paper. In general, the model itself isn’t that huge, in particular, they really shrink the dynamics net down to a single ResNet, it’s mostly the representation function which has the parameters.
Despite that, training on Colab isn’t really that tolerable. In 10h of training it does about 100k steps but I think only a small fraction of the amount of training of the paper’s implementation (though I don’t have numbers on how many batches they get through) so it’s very hard to work out whether it’s training or if a change has had a positive effect.
Basically I’ve been able to use what I think is the full model size, but the quantity of training is pretty low. I’m not sure if I’d have better performance if I cut the model size to speed up the training loop but I suspect not because that’s not been visible in CPU runs (but then I may also just have some error in my Atari-playing code).
Ah. I was remembering something about 20GB from their github, but it looks like it doesn’t correspond to model size like I thought. (I also forgot about the factor of ~3 difference between model size on disk and GPU usage, but even beyond that...)
Ah cheers, I’d not noticed that, trying to avoid looking too much. The way I understood it was that the DRAM usage corresponded very roughly to n_parameters * batch_size and with the batch_size I was able to tune the memory usage easily.
I’d not heard about the factor of 3, is that some particular trick for minimizing the GPU RAM cost?
Where does the 20GB number come from? I can’t see it in a quick scan of the paper. In general, the model itself isn’t that huge, in particular, they really shrink the dynamics net down to a single ResNet, it’s mostly the representation function which has the parameters.
Despite that, training on Colab isn’t really that tolerable. In 10h of training it does about 100k steps but I think only a small fraction of the amount of training of the paper’s implementation (though I don’t have numbers on how many batches they get through) so it’s very hard to work out whether it’s training or if a change has had a positive effect.
Basically I’ve been able to use what I think is the full model size, but the quantity of training is pretty low. I’m not sure if I’d have better performance if I cut the model size to speed up the training loop but I suspect not because that’s not been visible in CPU runs (but then I may also just have some error in my Atari-playing code).
Ah. I was remembering something about 20GB from their github, but it looks like it doesn’t correspond to model size like I thought. (I also forgot about the factor of ~3 difference between model size on disk and GPU usage, but even beyond that...)
Ah cheers, I’d not noticed that, trying to avoid looking too much. The way I understood it was that the DRAM usage corresponded very roughly to n_parameters * batch_size and with the batch_size I was able to tune the memory usage easily.
I’d not heard about the factor of 3, is that some particular trick for minimizing the GPU RAM cost?