Thanks! As I understand it, you are saying (a) In general it’s not hard to get 10x − 1000x speedups (as measured by flops per dollar? Or better yet, performance per dollar?) for very specific/narrow AI applications, if you design custom hardware for it, and (b) when AIs automate more of the chip design process, it’ll take less time and money to design custom hardware for stuff, so e.g. when Transformer 2.0 comes out, less than a year later there’ll be specialized hardware for it that makes it even better. Is this a fair summary?
If so, I’d be interested to hear why you said 10x − 1000x, as opposed to 2x or 1.1x. Has specialized hardware given 100x improvements in performance-per-dollar in the past? For neural nets in particular?
Yes, that’s a fair summary—though in “not hard … if you design custom hardware” the second clause is doing a lot of work.
As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than ‘just’ good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google’s previous CPU/GPU combination (this 2018 post is a good resource). So we’ve already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.
Truly application-specific hardware is a very wide reference class, but I think it’s reasonable to expect equivalent speedups for future applications. If we’re starting with something well-suited to existing accelerators like GPUs or TPUs, there’s less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more… it’s just terribly impractical due to the costs and latency of design and production with current technology.
For example, with custom hardware you can do bubblesort in O(n) time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in O(1) time and O(n) latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is “just” physical structure), but also almost absurdly powerful.
Thanks! As I understand it, you are saying (a) In general it’s not hard to get 10x − 1000x speedups (as measured by flops per dollar? Or better yet, performance per dollar?) for very specific/narrow AI applications, if you design custom hardware for it, and (b) when AIs automate more of the chip design process, it’ll take less time and money to design custom hardware for stuff, so e.g. when Transformer 2.0 comes out, less than a year later there’ll be specialized hardware for it that makes it even better. Is this a fair summary?
If so, I’d be interested to hear why you said 10x − 1000x, as opposed to 2x or 1.1x. Has specialized hardware given 100x improvements in performance-per-dollar in the past? For neural nets in particular?
Yes, that’s a fair summary—though in “not hard … if you design custom hardware” the second clause is doing a lot of work.
As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than ‘just’ good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google’s previous CPU/GPU combination (this 2018 post is a good resource). So we’ve already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.
Truly application-specific hardware is a very wide reference class, but I think it’s reasonable to expect equivalent speedups for future applications. If we’re starting with something well-suited to existing accelerators like GPUs or TPUs, there’s less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more… it’s just terribly impractical due to the costs and latency of design and production with current technology.
For example, with custom hardware you can do bubblesort in O(n) time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in O(1) time and O(n) latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is “just” physical structure), but also almost absurdly powerful.