There are a few reasons why “flops for dollars” maybe not the best measure.
1) One may confuse GPU performance with AI-related hardware performance. However, longterm trends in GPU performance is irrelevant, as GPUs started to be used in AI research only in 2012, which resulted in immediate jump of the performance 50 times compared with CPU.
2) Moreover, AI-hardware jump from classical GPU to tensor cores starting from 2016, and this provided more than order of magnitude increase in inference in Google’s TPU1.
(2014 NVIDIA 2014 K80 – 30 gflops/watt (Smith, 2014). vs. 2016 Google TPU1,– 575-2300 gflops/watt, but inference only, link)
In other words, slow progress in GPU should not be taken as evidence of the slow growth in AI hardware, as AI-computing is often jumping from one type of hardware to another with 1-2 order of magnitude gain of performance during each jump. The next jump will be probably to the spiking neural net chips.
Also, flops for dollar is a wrong measure of the total price of the ownership of AI-related hardware.
The biggest price is not hardware itself but electricity, data-center usage and human AI-scientists salaries.
3) Most of contemporary TPU are optimised for lower energy consumption as it is the biggest part of its price of the ownership, and energy consumption (in dollar-flops for AI-related computing) declines quicker than the price of the hardware itself (in flops-dollar). For example, 2019-planned, AI accelerator for Convolutional neural networks 24 TeraOPs/Watt (Gyrphalcon, 2018) which is 3 orders of magnitude better than NVIDIA K80 from 2014 which had 30 gigaflops per watt.
4) Also, one should calculate the total price of using of the data-center, not only the price of KWh consumed by a GPU. “The average annual data center cost per kW ranges from $5,467 for data centers larger than 50,000 square feet to $26,495 for facilities that are between 500 and 5,000 square feet in size” (link). This 6-30 times more than the price of the KWh, and the price is smaller for larger data centers. This means that larger computers are cheaper in the ownership—but this property can’t be measured by flops-dollar analysis.
5) The biggest part of the price of the ownership now is AI-scientists’ salaries. A team of good scientists could cost a few million dollars a year, let’s say 5 millions. This means that every additional day of calculations costs tens of thousands of dollars only for “idle” human power and thus the total price of AI research is smaller if very powerful computers are used. That is why there is a lot of attempts to scale up the size of the computers for AI. For example, Sony in November 2018 found the way to train Resnet-50 on massive GPU cluster only in 4 minutes.
6) “Dollar-flops” measure is not capable to present an important capability of parallelization. Early GPUs were limited in data-exchange capabilities, and could train only a few millions parameters nets using their dozen gigabytes memory. Now the progress in programming and hardware allows the connection of thousands GPUs into a one system, and the training of neural nets with hundred millions of parameters.
7) “Dollar-flops” also don’t present the progress of the “intellect stack” – that is, the vertical integration between high-level programming languages, low-level assembler like languages and the hardware which is an important part of the NVIDIA success. It includes CUDA, Keras, libraries, Youtube tutorials, you name it.
In other words, “flops for dollar” is not a good proxy of AI hardware growth, and using them may result in underestimating AI timing.
When trying to fit an exponential curve, don’t weight all the points equally. Or if you’re using excel and just want the easy way, take the log of your values and then fit a straight line to the logs.
Because the noise usually grows as the signal does. Consider Moore’s law for transistors per chip. Back when that number was about 10^4, the standard deviation was also small—say 10^3. Now that density is 10^8, no chips are going to be within a thousand transiators of each other, the standard deviation is much bigger (~10^7).
This means that if you’re trying to fit the curve, being off by 10^5 is a small mistake when preducting current transistor #, but a huge mistake when predicting past transistor #. It’s not rare or implausible now to find a chip with 10^5 more transistors, but back in the ’70s that difference is a huge error, impossible under an accurate model of reality.
A basic fitting function, like least squares, doesn’t take this into account. It will trade off transistors now vs. transistors in the past as if the mistakes were of exactly equal importance. To do better you have to use something like a chi squared method, where you explicitly weight the points differently based on their variance. Or fit on a log scale using the simple method, which effectively assumes that the noise is proportional to the signal.
When trying to fit an exponential curve, don’t weight all the points equally
We didn’t. We fit a line in log space, but weighted the points by sqrt(y). The reason we did that is because it doesn’t actually appear linear in log space.
This is what it looks like if we don’t weight them. If you want to bite the bullet of this being a better fit, we can bet about it.
Interesting, thanks. This “unweighted” (on a log scale) graph looks a lot more like what I’d expect to be a good fit for a single-exponential model.
Of course, if you don’t like how an exponential curve fits the data, you can always change models—in this case, probably to a curve with 1 more free parameter (indicating a degree of slowdown of the exponential growth) or 2 more free parameters (to have 2 different exponentials stitched together at a specific point in time).
Of course, if you don’t like how an exponential curve fits the data, you can always change models—in this case, probably to a curve with 1 more free parameter (indicating a degree of slowdown of the exponential growth) or 2 more free parameters (to have 2 different exponentials stitched together at a specific point in time).
Oh that’s actually a pretty good idea. Might redo some analysis we built on top of this model using that.
Ooh, shiny dataset! This is a good complement to cpudb, which does something similar for CPUs (but whose data fields get kinda sparse after 2012).
The first release of CUDA is in 2007; prior to that GPGPU wasn’t much of a thing. I think the extra-fast improvement from 2007 to 2012 represents the transition from game-graphics-oriented hardware to general-computing-oriented hardware.
There are a few reasons why “flops for dollars” maybe not the best measure.
1) One may confuse GPU performance with AI-related hardware performance. However, longterm trends in GPU performance is irrelevant, as GPUs started to be used in AI research only in 2012, which resulted in immediate jump of the performance 50 times compared with CPU.
2) Moreover, AI-hardware jump from classical GPU to tensor cores starting from 2016, and this provided more than order of magnitude increase in inference in Google’s TPU1.
(2014 NVIDIA 2014 K80 – 30 gflops/watt (Smith, 2014). vs. 2016 Google TPU1,– 575-2300 gflops/watt, but inference only, link)
In other words, slow progress in GPU should not be taken as evidence of the slow growth in AI hardware, as AI-computing is often jumping from one type of hardware to another with 1-2 order of magnitude gain of performance during each jump. The next jump will be probably to the spiking neural net chips.
Also, flops for dollar is a wrong measure of the total price of the ownership of AI-related hardware.
The biggest price is not hardware itself but electricity, data-center usage and human AI-scientists salaries.
3) Most of contemporary TPU are optimised for lower energy consumption as it is the biggest part of its price of the ownership, and energy consumption (in dollar-flops for AI-related computing) declines quicker than the price of the hardware itself (in flops-dollar). For example, 2019-planned, AI accelerator for Convolutional neural networks 24 TeraOPs/Watt (Gyrphalcon, 2018) which is 3 orders of magnitude better than NVIDIA K80 from 2014 which had 30 gigaflops per watt.
4) Also, one should calculate the total price of using of the data-center, not only the price of KWh consumed by a GPU. “The average annual data center cost per kW ranges from $5,467 for data centers larger than 50,000 square feet to $26,495 for facilities that are between 500 and 5,000 square feet in size” (link). This 6-30 times more than the price of the KWh, and the price is smaller for larger data centers. This means that larger computers are cheaper in the ownership—but this property can’t be measured by flops-dollar analysis.
5) The biggest part of the price of the ownership now is AI-scientists’ salaries. A team of good scientists could cost a few million dollars a year, let’s say 5 millions. This means that every additional day of calculations costs tens of thousands of dollars only for “idle” human power and thus the total price of AI research is smaller if very powerful computers are used. That is why there is a lot of attempts to scale up the size of the computers for AI. For example, Sony in November 2018 found the way to train Resnet-50 on massive GPU cluster only in 4 minutes.
6) “Dollar-flops” measure is not capable to present an important capability of parallelization. Early GPUs were limited in data-exchange capabilities, and could train only a few millions parameters nets using their dozen gigabytes memory. Now the progress in programming and hardware allows the connection of thousands GPUs into a one system, and the training of neural nets with hundred millions of parameters.
7) “Dollar-flops” also don’t present the progress of the “intellect stack” – that is, the vertical integration between high-level programming languages, low-level assembler like languages and the hardware which is an important part of the NVIDIA success. It includes CUDA, Keras, libraries, Youtube tutorials, you name it.
In other words, “flops for dollar” is not a good proxy of AI hardware growth, and using them may result in underestimating AI timing.
When trying to fit an exponential curve, don’t weight all the points equally. Or if you’re using excel and just want the easy way, take the log of your values and then fit a straight line to the logs.
Um… why?
Because the noise usually grows as the signal does. Consider Moore’s law for transistors per chip. Back when that number was about 10^4, the standard deviation was also small—say 10^3. Now that density is 10^8, no chips are going to be within a thousand transiators of each other, the standard deviation is much bigger (~10^7).
This means that if you’re trying to fit the curve, being off by 10^5 is a small mistake when preducting current transistor #, but a huge mistake when predicting past transistor #. It’s not rare or implausible now to find a chip with 10^5 more transistors, but back in the ’70s that difference is a huge error, impossible under an accurate model of reality.
A basic fitting function, like least squares, doesn’t take this into account. It will trade off transistors now vs. transistors in the past as if the mistakes were of exactly equal importance. To do better you have to use something like a chi squared method, where you explicitly weight the points differently based on their variance. Or fit on a log scale using the simple method, which effectively assumes that the noise is proportional to the signal.
That makes perfect sense. Thanks.
We didn’t. We fit a line in log space, but weighted the points by sqrt(y). The reason we did that is because it doesn’t actually appear linear in log space.
This is what it looks like if we don’t weight them. If you want to bite the bullet of this being a better fit, we can bet about it.
Interesting, thanks. This “unweighted” (on a log scale) graph looks a lot more like what I’d expect to be a good fit for a single-exponential model.
Of course, if you don’t like how an exponential curve fits the data, you can always change models—in this case, probably to a curve with 1 more free parameter (indicating a degree of slowdown of the exponential growth) or 2 more free parameters (to have 2 different exponentials stitched together at a specific point in time).
Oh that’s actually a pretty good idea. Might redo some analysis we built on top of this model using that.
Ooh, shiny dataset! This is a good complement to cpudb, which does something similar for CPUs (but whose data fields get kinda sparse after 2012).
The first release of CUDA is in 2007; prior to that GPGPU wasn’t much of a thing. I think the extra-fast improvement from 2007 to 2012 represents the transition from game-graphics-oriented hardware to general-computing-oriented hardware.