They apply this to find hardware-specific matmuls (by adding an additional reward equal to -time to the terminal state) that have a 10-20% larger speedup than Strassen’s algorithm on NVIDIA V100s and TPU V2s (saving 4%/7.5% wall clock time).
Does this mean that you can essentially improve the performance of the kind of matrix multiplication that’s normal in machine learning by 4% so that any machine learning will get speed up by at least 4% if what they found gets implemented?
Does this mean that you can essentially improve the performance of the kind of matrix multiplication that’s normal in machine learning by 4% so that any machine learning will get speed up by at least 4% if what they found gets implemented?