Power in = power out, but a power limit is quite different to a thermal limit. An embedded microcontroller running off a watch battery still obeys power in = power out, but is generally only limited by how much power you can put in, not its thermals.
This doesn’t make sense to me—in what sense are they not thermally limited? Nvidia could not viably put out a consumer GPU that used 3 kilowatts for example.
This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
It is true that devices are thermally limited in the sense that they could run faster if they had more power, but because of the steep efficiency curve, this is not at all the same as saying that they could not productively use more transistors, nor does it directly corresponds to dark silicon in a meaningful way. The power level is a balance between this performance increase and the cost of the power draw (which includes things like the cost of the power supplies and heatsink). As the slope of power needed per unit extra performance effectively approaches infinity, you will always find that the optimal trade-off is below theoretical peak performance.
If you add more transistors without improving those transistors’ power efficiency, and without improving power extraction, you can initially just run those greater number of transistors at a more optimal power ratio.
A 2x density scaling without a 2x energy efficiency scaling just results in 2x higher dark silicon ratio—this is already the case and why nvidia’s recent GPU dies are increasingly split into specialized components: FP/int, tensorcore, ray tracing, etc.
This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
I’m not sure what you mean here—from what I recall the flip/J metrics of the low power/mobile process nodes are on the order of 25% gains or so, not 100%. Phones/laptops have smaller processor dies and more dark silicon, not dramatically more efficient transistors.
The point isn’t the logic cell, those tend to be marginal improvements as you say. The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080). Phone GPUs are much more efficient again than laptop GPUs.
It is true that a phone SoC has more dark silicon than a dedicated GPU, but this is just because phone SoCs do a lot of disparate tasks, which are individually optimized for. Their GPUs are not particularly more dark than other GPUs, and GPUs in general are not particularly more dark than necessary for their construction.
It should also be noted that dark silicon is not the same as wasted silicon.
$1/day of electricity (at $0.15 / kwhr) + $1/day for cooling (1:1 is a reasonable rule of thumb, but obviously depends on environment)
This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
Are you talking about clock rates? Those haven’t changed for GPUs in a while, I’m assuming they will remain essentially fixed. Doubling the power into a desktop GPU at fixed clock rate (and ignoring dark silicon fraction) thus corresponds to doubling the transistor count (at the same transistor energy efficiency), which would double performance, power, and thermal draw all together.
This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere, and I’m assuming the same probably applies at some level internally for the many paths inside tensorcores and RT cores. I am less certain about perf/power for simultaneously maxing tensorcores+RTcores+alucores+mempaths, but I’m guessing it would thermal limit and underclock to some degree.
The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080).
Primarily through lowered clock rates or dark silicon. I ignored clock rates because they seem irrelevant for the future of Moore’s law.
Google has unusually efficient data-centers, but I’d also bet that efficiency measure isn’t for a pure GPU datacenter, which would have dramatically higher energy density and thus cooling challenges than their typical light CPU heavy storage search-optimized servers.
Clock rate is relevant. Or rather, the underlying aspects that in part determine clock rate are relevant. It is true that doubling transistor density while holding all else equal would require much more thermal output, but it’s not the only option, were thermal constraints the dominant factor.
I agree there is only so much room to be gained here, which would quickly vanish in the face of exponential trends, but this part of our debate came up in the context of whether current GPUs are already past this point. I claim they aren’t, and that being so far past the point of maximal energy efficiency is evidence of it.
Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere
This doesn’t make sense technically; if anything Ampere moves in the opposite direction, by making both datapaths be able to do FP simultaneously (though this is ultimately a mild effect that isn’t really relevant). To quote the GA102 whitepaper,
Most graphics workloads are composed of 32-bit floating point (FP32) operations. The Streaming Multiprocessor (SM) in the Ampere GA10x GPU Architecture has been designed to support double-speed processing for FP32 operations. In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10x includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. As a result, GeForce RTX 3090 delivers over 35 FP32 TFLOPS, an improvement of over 2x compared to Turing GPUs.
I briefly looked for the source for your comment and didn’t find it.
Google has unusually efficient data-centers
We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
I was aware the 3090 had 2x FP32, but I thought that dual FP thing was specific to the GA102. Actually the GA102 just has 2x the ALU cores per SM vs the GA100.
We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
There are efficiency transitions from passive to active, air to liquid, etc, that all depend on energy density.
Power in = power out, but a power limit is quite different to a thermal limit. An embedded microcontroller running off a watch battery still obeys power in = power out, but is generally only limited by how much power you can put in, not its thermals.
This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
It is true that devices are thermally limited in the sense that they could run faster if they had more power, but because of the steep efficiency curve, this is not at all the same as saying that they could not productively use more transistors, nor does it directly corresponds to dark silicon in a meaningful way. The power level is a balance between this performance increase and the cost of the power draw (which includes things like the cost of the power supplies and heatsink). As the slope of power needed per unit extra performance effectively approaches infinity, you will always find that the optimal trade-off is below theoretical peak performance.
If you add more transistors without improving those transistors’ power efficiency, and without improving power extraction, you can initially just run those greater number of transistors at a more optimal power ratio.
This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
The point isn’t the logic cell, those tend to be marginal improvements as you say. The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080). Phone GPUs are much more efficient again than laptop GPUs.
It is true that a phone SoC has more dark silicon than a dedicated GPU, but this is just because phone SoCs do a lot of disparate tasks, which are individually optimized for. Their GPUs are not particularly more dark than other GPUs, and GPUs in general are not particularly more dark than necessary for their construction.
It should also be noted that dark silicon is not the same as wasted silicon.
Note that Google claims ~10:1.
I’m not convinced mining is a good proxy here, their market is weird, but it sounds like you agree that power is a significant but lesser cost.
Are you talking about clock rates? Those haven’t changed for GPUs in a while, I’m assuming they will remain essentially fixed. Doubling the power into a desktop GPU at fixed clock rate (and ignoring dark silicon fraction) thus corresponds to doubling the transistor count (at the same transistor energy efficiency), which would double performance, power, and thermal draw all together.
Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere, and I’m assuming the same probably applies at some level internally for the many paths inside tensorcores and RT cores. I am less certain about perf/power for simultaneously maxing tensorcores+RTcores+alucores+mempaths, but I’m guessing it would thermal limit and underclock to some degree.
Primarily through lowered clock rates or dark silicon. I ignored clock rates because they seem irrelevant for the future of Moore’s law.
Google has unusually efficient data-centers, but I’d also bet that efficiency measure isn’t for a pure GPU datacenter, which would have dramatically higher energy density and thus cooling challenges than their typical light CPU heavy storage search-optimized servers.
Clock rate is relevant. Or rather, the underlying aspects that in part determine clock rate are relevant. It is true that doubling transistor density while holding all else equal would require much more thermal output, but it’s not the only option, were thermal constraints the dominant factor.
I agree there is only so much room to be gained here, which would quickly vanish in the face of exponential trends, but this part of our debate came up in the context of whether current GPUs are already past this point. I claim they aren’t, and that being so far past the point of maximal energy efficiency is evidence of it.
This doesn’t make sense technically; if anything Ampere moves in the opposite direction, by making both datapaths be able to do FP simultaneously (though this is ultimately a mild effect that isn’t really relevant). To quote the GA102 whitepaper,
I briefly looked for the source for your comment and didn’t find it.
We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
I was aware the 3090 had 2x FP32, but I thought that dual FP thing was specific to the GA102. Actually the GA102 just has 2x the ALU cores per SM vs the GA100.
There are efficiency transitions from passive to active, air to liquid, etc, that all depend on energy density.