Yes, that’s a fair summary—though in “not hard … if you design custom hardware” the second clause is doing a lot of work.
As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than ‘just’ good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google’s previous CPU/GPU combination (this 2018 post is a good resource). So we’ve already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.
Truly application-specific hardware is a very wide reference class, but I think it’s reasonable to expect equivalent speedups for future applications. If we’re starting with something well-suited to existing accelerators like GPUs or TPUs, there’s less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more… it’s just terribly impractical due to the costs and latency of design and production with current technology.
For example, with custom hardware you can do bubblesort in O(n) time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in O(1) time and O(n) latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is “just” physical structure), but also almost absurdly powerful.
Yes, that’s a fair summary—though in “not hard … if you design custom hardware” the second clause is doing a lot of work.
As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than ‘just’ good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google’s previous CPU/GPU combination (this 2018 post is a good resource). So we’ve already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.
Truly application-specific hardware is a very wide reference class, but I think it’s reasonable to expect equivalent speedups for future applications. If we’re starting with something well-suited to existing accelerators like GPUs or TPUs, there’s less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more… it’s just terribly impractical due to the costs and latency of design and production with current technology.
For example, with custom hardware you can do bubblesort in O(n) time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in O(1) time and O(n) latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is “just” physical structure), but also almost absurdly powerful.