I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I noticed were better normalization, label smoothing, a somewhat tweaked input pipeline (not sure if optimization or refactoring) and updating Tensorflow a few versions (plausibly includes a bunch of hardware optimizations like you’re talking about).
The things they took from fast.ai for the 2x speedup were training on progressively larger image sizes, and the better triangular learning rate schedule. Separately for their later submissions, which don’t include a single-GPU figure, fast.ai came up with better methods of cropping and augmentation that improve accuracy. I don’t necessarily think the 2x speedup pace through clever ideas pace is sustainable, lots of the fast.ai ideas seem to be pretty low hanging fruit.
I basically agree with the quoted part of your take, just that I don’t think it explains enough of the apathy towards training speed that I see, although I think it might more fully explain the situation at OpenAI and DeepMind. I’m making more of a revealed preferences efficient markets kind of argument where I think the fact that those low hanging fruits weren’t picked and aren’t incorporated into the vast majority of deep learning projects suggests that researchers are sufficiently un-constrained by training times that it isn’t worth their time to optimize things.
Like I say in the article though, I’m not super confident and I could be underestimating the zeal for faster training because of sampling error of what I’ve seen, read and thought of, or it could just be inefficient markets.
I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I noticed were better normalization, label smoothing, a somewhat tweaked input pipeline (not sure if optimization or refactoring) and updating Tensorflow a few versions (plausibly includes a bunch of hardware optimizations like you’re talking about).
The things they took from fast.ai for the 2x speedup were training on progressively larger image sizes, and the better triangular learning rate schedule. Separately for their later submissions, which don’t include a single-GPU figure, fast.ai came up with better methods of cropping and augmentation that improve accuracy. I don’t necessarily think the 2x speedup pace through clever ideas pace is sustainable, lots of the fast.ai ideas seem to be pretty low hanging fruit.
I basically agree with the quoted part of your take, just that I don’t think it explains enough of the apathy towards training speed that I see, although I think it might more fully explain the situation at OpenAI and DeepMind. I’m making more of a revealed preferences efficient markets kind of argument where I think the fact that those low hanging fruits weren’t picked and aren’t incorporated into the vast majority of deep learning projects suggests that researchers are sufficiently un-constrained by training times that it isn’t worth their time to optimize things.
Like I say in the article though, I’m not super confident and I could be underestimating the zeal for faster training because of sampling error of what I’ve seen, read and thought of, or it could just be inefficient markets.