Some of this was just using more and better hardware, the winning team used 128 V100 GPUs for 18 minutes and 64 for 19 minutes, versus eight K80 GPUs for the baseline. However, substantial improvements were made even on the same hardware. The training time on a p3.16xlarge AWS instance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months.
Was the original 15 hour time for fp16 training, or fp32?
(A factor of 5 in a few months seems plausible, but before updating on that datapoint it would be good to know if it’s just from switching to tensor cores which would be a rather different narrative.)
I just checked and seems it was fp32. I agree this makes it less impressive, I forgot to check that originally. I still think this somewhat counts as a software win, because working fp16 training required a bunch of programmer effort to take advantage of the hardware, just like optimization to make better use of cache would.
However, there’s also a different set of same-machine datapoints available in the benchmark, where training time on a single Cloud TPU v2 went down from 12 hours 30 minutes to 2 hours 44 minutes, which is a 4.5x speedup similar to the 5x achieved on the V100. The Cloud TPU was special-purpose hardware being trained with bfloat16 from the start, so that’s a similar magnitude improvement more clearly due to software. The history shows incremental progress down to 6 hours and then a 2x speedup once the fast.ai team published and the Google Brain team incorporated their techniques.
I think that fp32 → fp16 should give a >5x boost on a V100, so this 5x improvement still probably hides some inefficiencies when running in fp16.
I suspect the initial 15 - > 6 hour improvement on TPUs was also mostly dealing with low hanging fruit and cleaning up various inefficiencies from porting older code to a TPU / larger batch size / etc.. It seems plausible the last factor of 2 is more of a steady state improvement, I don’t know.
My take on this story would be: “Hardware has been changing rapidly, giving large speedups, and people at the same time people have been scaling up to larger batch sizes in order to spend more money. Each time hardware or scale changes, old software is poorly adapted, and it requires some engineering effort to make full use of the new setup.” On this reading, these speedups don’t provide as much insight into whether future progress will be driven by hardware.
I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I noticed were better normalization, label smoothing, a somewhat tweaked input pipeline (not sure if optimization or refactoring) and updating Tensorflow a few versions (plausibly includes a bunch of hardware optimizations like you’re talking about).
The things they took from fast.ai for the 2x speedup were training on progressively larger image sizes, and the better triangular learning rate schedule. Separately for their later submissions, which don’t include a single-GPU figure, fast.ai came up with better methods of cropping and augmentation that improve accuracy. I don’t necessarily think the 2x speedup pace through clever ideas pace is sustainable, lots of the fast.ai ideas seem to be pretty low hanging fruit.
I basically agree with the quoted part of your take, just that I don’t think it explains enough of the apathy towards training speed that I see, although I think it might more fully explain the situation at OpenAI and DeepMind. I’m making more of a revealed preferences efficient markets kind of argument where I think the fact that those low hanging fruits weren’t picked and aren’t incorporated into the vast majority of deep learning projects suggests that researchers are sufficiently un-constrained by training times that it isn’t worth their time to optimize things.
Like I say in the article though, I’m not super confident and I could be underestimating the zeal for faster training because of sampling error of what I’ve seen, read and thought of, or it could just be inefficient markets.
Was the original 15 hour time for fp16 training, or fp32?
(A factor of 5 in a few months seems plausible, but before updating on that datapoint it would be good to know if it’s just from switching to tensor cores which would be a rather different narrative.)
I just checked and seems it was fp32. I agree this makes it less impressive, I forgot to check that originally. I still think this somewhat counts as a software win, because working fp16 training required a bunch of programmer effort to take advantage of the hardware, just like optimization to make better use of cache would.
However, there’s also a different set of same-machine datapoints available in the benchmark, where training time on a single Cloud TPU v2 went down from 12 hours 30 minutes to 2 hours 44 minutes, which is a 4.5x speedup similar to the 5x achieved on the V100. The Cloud TPU was special-purpose hardware being trained with bfloat16 from the start, so that’s a similar magnitude improvement more clearly due to software. The history shows incremental progress down to 6 hours and then a 2x speedup once the fast.ai team published and the Google Brain team incorporated their techniques.
I think that fp32 → fp16 should give a >5x boost on a V100, so this 5x improvement still probably hides some inefficiencies when running in fp16.
I suspect the initial 15 - > 6 hour improvement on TPUs was also mostly dealing with low hanging fruit and cleaning up various inefficiencies from porting older code to a TPU / larger batch size / etc.. It seems plausible the last factor of 2 is more of a steady state improvement, I don’t know.
My take on this story would be: “Hardware has been changing rapidly, giving large speedups, and people at the same time people have been scaling up to larger batch sizes in order to spend more money. Each time hardware or scale changes, old software is poorly adapted, and it requires some engineering effort to make full use of the new setup.” On this reading, these speedups don’t provide as much insight into whether future progress will be driven by hardware.
I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I noticed were better normalization, label smoothing, a somewhat tweaked input pipeline (not sure if optimization or refactoring) and updating Tensorflow a few versions (plausibly includes a bunch of hardware optimizations like you’re talking about).
The things they took from fast.ai for the 2x speedup were training on progressively larger image sizes, and the better triangular learning rate schedule. Separately for their later submissions, which don’t include a single-GPU figure, fast.ai came up with better methods of cropping and augmentation that improve accuracy. I don’t necessarily think the 2x speedup pace through clever ideas pace is sustainable, lots of the fast.ai ideas seem to be pretty low hanging fruit.
I basically agree with the quoted part of your take, just that I don’t think it explains enough of the apathy towards training speed that I see, although I think it might more fully explain the situation at OpenAI and DeepMind. I’m making more of a revealed preferences efficient markets kind of argument where I think the fact that those low hanging fruits weren’t picked and aren’t incorporated into the vast majority of deep learning projects suggests that researchers are sufficiently un-constrained by training times that it isn’t worth their time to optimize things.
Like I say in the article though, I’m not super confident and I could be underestimating the zeal for faster training because of sampling error of what I’ve seen, read and thought of, or it could just be inefficient markets.