This is helpful for something I’ve been working on—thanks!
I was initially confused about how these results could fit with claims from this paper on AI chips, which emphasizes the importance of factors other than transistor density for AI-specialized chips’ performance. But on second thought, the claims seem compatible:
The paper argues that increases in transistor density have (recently) been slow enough for investment in specialized chip design to be practical. But that’s compatible with increases in transistor density still being the main driver of performance improvements (since a proportionally small boost that lasts several years could still make specialization profitable).
The paper claims that “AI[-specialized] chips are tens or even thousands of times faster and more efficient than CPUs for training and inference of AI algorithms.” But the graph in this post shows less than thousands of times improvements since 2006. These are compatible if remaining efficiency gains of AI-specialized chips came before 2006, which is plausible since GPUs were first released in 1999 (or maybe the “thousands of times” suggestion was just too high).
Yep, I think you’re right that both views are compatible. In terms of performance comparison, the architectures are quite different and so while looking at raw floating-point performance gives you a rough idea of the device’s capabilities, performance on specific benchmarks can be quite different. Optimization adds another dimension entirely, for example NVIDIA has highly-optimized DNN libraries that achieve very impressive performance (as a fraction of raw floating-point performance) on their GPU hardware. AFAIK nobody is spending that much effort (e.g. teams of engineers x several months) to optimize deep learning models on CPU these days because it isn’t worth the return on investment.
Yeah pretty much. If you think about mapping something like matrix-multiply to a specific hardware device, details like how the data is laid out in memory, utilizing the cache hierarchy effectively, efficiently moving data around the system, etc are important for performance.
As a follow-up to building the model, I was looking into specialized AI hardware and I have to say that I’m very uncertain about the claimed efficiency gains. There are some parts of the AI training pipeline that could be improved with specialized hardware but others seem to be pretty close to their limits.
We intend to understand this better and publish a piece in the future but it’s currently not high on the priority list.
Also, when compared to CPUs, it’s no wonder that any parallelized hardware is 1000x more efficient. So it really depends on what the exact comparison is that the authors used.
Sorry, I’m a bit confused. I’m interpreting the 1st and 3rd paragraphs of your response as expressing opposite opinions about the claimed efficiency gains (uncertainty and confidence, respectively), so I think I’m probably misinterpreting part of your response?
By uncertainty I mean, I really don’t know, i.e. I could imagine both very high and very low gains. I didn’t want to express that I’m skeptical.
For the third paragraph, I guess it depends on what you think of as specialized hardware. If you think GPUs are specialized hardware than a gain of 1000x from CPUs to GPUs sounds very plausible to me. If you think GPUs are the baseline and specialized hardware are e.g. TPUs, then a 1000x gain sounds implausible to me.
My original answers wasn’t that clear. Does this make more sense to you?
This is helpful for something I’ve been working on—thanks!
I was initially confused about how these results could fit with claims from this paper on AI chips, which emphasizes the importance of factors other than transistor density for AI-specialized chips’ performance. But on second thought, the claims seem compatible:
The paper argues that increases in transistor density have (recently) been slow enough for investment in specialized chip design to be practical. But that’s compatible with increases in transistor density still being the main driver of performance improvements (since a proportionally small boost that lasts several years could still make specialization profitable).
The paper claims that “AI[-specialized] chips are tens or even thousands of times faster and more efficient than CPUs for training and inference of AI algorithms.” But the graph in this post shows less than thousands of times improvements since 2006. These are compatible if remaining efficiency gains of AI-specialized chips came before 2006, which is plausible since GPUs were first released in 1999 (or maybe the “thousands of times” suggestion was just too high).
Yep, I think you’re right that both views are compatible. In terms of performance comparison, the architectures are quite different and so while looking at raw floating-point performance gives you a rough idea of the device’s capabilities, performance on specific benchmarks can be quite different. Optimization adds another dimension entirely, for example NVIDIA has highly-optimized DNN libraries that achieve very impressive performance (as a fraction of raw floating-point performance) on their GPU hardware. AFAIK nobody is spending that much effort (e.g. teams of engineers x several months) to optimize deep learning models on CPU these days because it isn’t worth the return on investment.
Thanks! To make sure I’m following, does optimization help just by improving utilization?
Yeah pretty much. If you think about mapping something like matrix-multiply to a specific hardware device, details like how the data is laid out in memory, utilizing the cache hierarchy effectively, efficiently moving data around the system, etc are important for performance.
As a follow-up to building the model, I was looking into specialized AI hardware and I have to say that I’m very uncertain about the claimed efficiency gains. There are some parts of the AI training pipeline that could be improved with specialized hardware but others seem to be pretty close to their limits.
We intend to understand this better and publish a piece in the future but it’s currently not high on the priority list.
Also, when compared to CPUs, it’s no wonder that any parallelized hardware is 1000x more efficient. So it really depends on what the exact comparison is that the authors used.
Sorry, I’m a bit confused. I’m interpreting the 1st and 3rd paragraphs of your response as expressing opposite opinions about the claimed efficiency gains (uncertainty and confidence, respectively), so I think I’m probably misinterpreting part of your response?
By uncertainty I mean, I really don’t know, i.e. I could imagine both very high and very low gains. I didn’t want to express that I’m skeptical.
For the third paragraph, I guess it depends on what you think of as specialized hardware. If you think GPUs are specialized hardware than a gain of 1000x from CPUs to GPUs sounds very plausible to me. If you think GPUs are the baseline and specialized hardware are e.g. TPUs, then a 1000x gain sounds implausible to me.
My original answers wasn’t that clear. Does this make more sense to you?
It does, thanks! (I had interpreted the claim in the paper as comparing e.g. TPUs to CPUs, since the quote mentions CPUs as the baseline.)