Circuit design is the main bottleneck for use of field-programmable gate arrays. If fully-automated designs become good enough, we could see substantial gains from having optimising compilers output a gate layout rather than machine code for an xPU or specific accelerator. We already have some such compilers, and this looks like a meaningful step towards handling non-toy-scale problems with them.
The main change here wouldn’t be so much training speed—we already have TPUs etc. to accelerate current workloads, and fabricating a new design as ASICs rather than FPGA layouts takes months-to-years at scale—but rather the latency with which we can try out custom hardware for novel ML paradigms such as transformers. What is to transformers as TPUs are to CNNs? Specifically for novel tasks, this could be a 10x-1000x speedup, and 2x-50x speedup for existing workloads… though I understand they’re bottlenecked more on data movement between nodes than compute.
TLDR: a small step in a high-long-term-impact trend.
(Source: while I’m not a hardware specialist, I’ve worked with the PyMTL team at Cornell on verification and validation of their Python-to-Verilog-to-silicon hardware design tools, followed high-level developments in custom compute hardware for around a decade, and worked on peta-scale supercomputing for a few years.)
I think this is incorrect. You might imagine that CPU->GPU and GPU->TPU transitions were steps up a tall log-scale tech ladder, in the way that Moore’s-law doublings were, with many more steps still possible in theory. But this is not the case, because the metric these transitions were improving on was “fraction of transistors which are dedicated to useful compute” (as opposed to extracting parallelism from a serial instruction stream, or computing unnecessary low-order bits on overly-wide floating point). This metric has a hard upper limit, at 100%, and I don’t think there’s even one order of magnitude left between current utilization and that limit.
No, I think we mostly agree—I’d expect TPUs to be with say 4x of practically optimal for the things they do. The remaining one OOM I think is possible for non-novel tasks has more to do with specialisation, eg model-specific hardware design, and that definitely has an asymtote.
The interesting case is if we can get TPU-equivalent hardware days after designing a new architecture, instead of years after, because (IMO) 1,000x speedups over CPUs are plausible.
Thanks! As I understand it, you are saying (a) In general it’s not hard to get 10x − 1000x speedups (as measured by flops per dollar? Or better yet, performance per dollar?) for very specific/narrow AI applications, if you design custom hardware for it, and (b) when AIs automate more of the chip design process, it’ll take less time and money to design custom hardware for stuff, so e.g. when Transformer 2.0 comes out, less than a year later there’ll be specialized hardware for it that makes it even better. Is this a fair summary?
If so, I’d be interested to hear why you said 10x − 1000x, as opposed to 2x or 1.1x. Has specialized hardware given 100x improvements in performance-per-dollar in the past? For neural nets in particular?
Yes, that’s a fair summary—though in “not hard … if you design custom hardware” the second clause is doing a lot of work.
As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than ‘just’ good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google’s previous CPU/GPU combination (this 2018 post is a good resource). So we’ve already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.
Truly application-specific hardware is a very wide reference class, but I think it’s reasonable to expect equivalent speedups for future applications. If we’re starting with something well-suited to existing accelerators like GPUs or TPUs, there’s less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more… it’s just terribly impractical due to the costs and latency of design and production with current technology.
For example, with custom hardware you can do bubblesort in O(n) time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in O(1) time and O(n) latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is “just” physical structure), but also almost absurdly powerful.
Circuit design is the main bottleneck for use of field-programmable gate arrays. If fully-automated designs become good enough, we could see substantial gains from having optimising compilers output a gate layout rather than machine code for an xPU or specific accelerator. We already have some such compilers, and this looks like a meaningful step towards handling non-toy-scale problems with them.
The main change here wouldn’t be so much training speed—we already have TPUs etc. to accelerate current workloads, and fabricating a new design as ASICs rather than FPGA layouts takes months-to-years at scale—but rather the latency with which we can try out custom hardware for novel ML paradigms such as transformers. What is to transformers as TPUs are to CNNs? Specifically for novel tasks, this could be a 10x-1000x speedup, and 2x-50x speedup for existing workloads… though I understand they’re bottlenecked more on data movement between nodes than compute.
TLDR: a small step in a high-long-term-impact trend.
(Source: while I’m not a hardware specialist, I’ve worked with the PyMTL team at Cornell on verification and validation of their Python-to-Verilog-to-silicon hardware design tools, followed high-level developments in custom compute hardware for around a decade, and worked on peta-scale supercomputing for a few years.)
I think this is incorrect. You might imagine that CPU->GPU and GPU->TPU transitions were steps up a tall log-scale tech ladder, in the way that Moore’s-law doublings were, with many more steps still possible in theory. But this is not the case, because the metric these transitions were improving on was “fraction of transistors which are dedicated to useful compute” (as opposed to extracting parallelism from a serial instruction stream, or computing unnecessary low-order bits on overly-wide floating point). This metric has a hard upper limit, at 100%, and I don’t think there’s even one order of magnitude left between current utilization and that limit.
No, I think we mostly agree—I’d expect TPUs to be with say 4x of practically optimal for the things they do. The remaining one OOM I think is possible for non-novel tasks has more to do with specialisation, eg model-specific hardware design, and that definitely has an asymtote.
The interesting case is if we can get TPU-equivalent hardware days after designing a new architecture, instead of years after, because (IMO) 1,000x speedups over CPUs are plausible.
Thanks! As I understand it, you are saying (a) In general it’s not hard to get 10x − 1000x speedups (as measured by flops per dollar? Or better yet, performance per dollar?) for very specific/narrow AI applications, if you design custom hardware for it, and (b) when AIs automate more of the chip design process, it’ll take less time and money to design custom hardware for stuff, so e.g. when Transformer 2.0 comes out, less than a year later there’ll be specialized hardware for it that makes it even better. Is this a fair summary?
If so, I’d be interested to hear why you said 10x − 1000x, as opposed to 2x or 1.1x. Has specialized hardware given 100x improvements in performance-per-dollar in the past? For neural nets in particular?
Yes, that’s a fair summary—though in “not hard … if you design custom hardware” the second clause is doing a lot of work.
As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than ‘just’ good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google’s previous CPU/GPU combination (this 2018 post is a good resource). So we’ve already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.
Truly application-specific hardware is a very wide reference class, but I think it’s reasonable to expect equivalent speedups for future applications. If we’re starting with something well-suited to existing accelerators like GPUs or TPUs, there’s less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more… it’s just terribly impractical due to the costs and latency of design and production with current technology.
For example, with custom hardware you can do bubblesort in O(n) time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in O(1) time and O(n) latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is “just” physical structure), but also almost absurdly powerful.