Very nice! I think the original paper used a ~20GB model and ~20 CPUs (so 20 or some small multiple GPUs), to train in 7 hours. How much have you shrunk it to make training doable and tolerable on colab?
Where does the 20GB number come from? I can’t see it in a quick scan of the paper. In general, the model itself isn’t that huge, in particular, they really shrink the dynamics net down to a single ResNet, it’s mostly the representation function which has the parameters.
Despite that, training on Colab isn’t really that tolerable. In 10h of training it does about 100k steps but I think only a small fraction of the amount of training of the paper’s implementation (though I don’t have numbers on how many batches they get through) so it’s very hard to work out whether it’s training or if a change has had a positive effect.
Basically I’ve been able to use what I think is the full model size, but the quantity of training is pretty low. I’m not sure if I’d have better performance if I cut the model size to speed up the training loop but I suspect not because that’s not been visible in CPU runs (but then I may also just have some error in my Atari-playing code).
Ah. I was remembering something about 20GB from their github, but it looks like it doesn’t correspond to model size like I thought. (I also forgot about the factor of ~3 difference between model size on disk and GPU usage, but even beyond that...)
Ah cheers, I’d not noticed that, trying to avoid looking too much. The way I understood it was that the DRAM usage corresponded very roughly to n_parameters * batch_size and with the batch_size I was able to tune the memory usage easily.
I’d not heard about the factor of 3, is that some particular trick for minimizing the GPU RAM cost?
Very nice! I think the original paper used a ~20GB model and ~20 CPUs (so 20 or some small multiple GPUs), to train in 7 hours. How much have you shrunk it to make training doable and tolerable on colab?
Where does the 20GB number come from? I can’t see it in a quick scan of the paper. In general, the model itself isn’t that huge, in particular, they really shrink the dynamics net down to a single ResNet, it’s mostly the representation function which has the parameters.
Despite that, training on Colab isn’t really that tolerable. In 10h of training it does about 100k steps but I think only a small fraction of the amount of training of the paper’s implementation (though I don’t have numbers on how many batches they get through) so it’s very hard to work out whether it’s training or if a change has had a positive effect.
Basically I’ve been able to use what I think is the full model size, but the quantity of training is pretty low. I’m not sure if I’d have better performance if I cut the model size to speed up the training loop but I suspect not because that’s not been visible in CPU runs (but then I may also just have some error in my Atari-playing code).
Ah. I was remembering something about 20GB from their github, but it looks like it doesn’t correspond to model size like I thought. (I also forgot about the factor of ~3 difference between model size on disk and GPU usage, but even beyond that...)
Ah cheers, I’d not noticed that, trying to avoid looking too much. The way I understood it was that the DRAM usage corresponded very roughly to n_parameters * batch_size and with the batch_size I was able to tune the memory usage easily.
I’d not heard about the factor of 3, is that some particular trick for minimizing the GPU RAM cost?