Ok that might be a crux. I am claiming that new GPU designs are already energy limited; that is the main constraint GPU engineers care about.
I agree this seems to be our main departure.
You seem to be conflating two limits, power limits, as in how much energy we can put into a system, and thermal limits, as in how much energy can we extract from that system to cool it down.
With regards to thermal limits, GPUs run fairly far into the diminishing returns of their power-performance curve, and pushing them further, even with liquid nitrogen, doesn’t help by a disproportionate amount. NVIDIA is pushing significantly more power into their top end GPUs than they need to approximately hit their peak performance. Compare phone to laptop to desktop GPUs; efficiency/transistor improves drastically as power goes down. So it seems to me like GPUs are not yet thermally limited, in the sense that having more transistor density would still allow performance scaling even in lieu of those transistors becoming more efficient.
Arguably this could be a result of architectural trade-offs prioritizing mobile, but flagships sell cards, and so if NVIDIA was willing to give those cards so much power, they should optimize for them to consume that much power. I’d also expect that to pan out as a greater advantage for competitors that target servers specifically, which we don’t see. Anyhow, this isn’t a physical limit, as there exist much better ways to extract heat than we are currently using, if this was something that scaling needed doing.
You seem mostly concerned on the other point, which is the power limit, specifically those derived from the price for that power. My understanding is that power is a significant fraction of server costs, but still significantly less than the amortized cost of the hardware.
You seem to be conflating two limits, power limits, as in how much energy we can put into a system, and thermal limits, as in how much energy can we extract from that system to cool it down.
I didn’t use the word thermal, but of course they are trivially related as power in = heat out for irreversible computers, so power/thermal limit can be used interchangeably in that sense. GPUs (and any processor really) have a power/thermal design limit based on what’s commercially feasible to support both in terms of the power supply and the required cooling.
So it seems to me like GPUs are not yet thermally limited, in the sense that having more transistor density would still allow performance scaling even in lieu of those transistors becoming more efficient.
This doesn’t make sense to me—in what sense are they not thermally limited? Nvidia could not viably put out a consumer GPU that used 3 kilowatts for example. The RTX 3090 pushing power draw up to 350 watts was a big deal. Enterprise GPUs are even more power constrained, if anything (the flagship A100 uses 250 watts—although I believe it’s using a slightly better TSMC node rather than samsung) - and also enormously more expensive per flop.
A 2x density scaling without a 2x energy efficiency scaling just results in 2x higher dark silicon ratio—this is already the case and why nvidia’s recent GPU dies are increasingly split into specialized components: FP/int, tensorcore, ray tracing, etc.
Compare phone to laptop to desktop GPUs; efficiency/transistor improves drastically as power goes down.
I’m not sure what you mean here—from what I recall the flip/J metrics of the low power/mobile process nodes are on the order of 25% gains or so, not 100%. Phones/laptops have smaller processor dies and more dark silicon, not dramatically more efficient transistors.
My understanding is that power is a significant fraction of server costs, but still significantly less than the amortized cost of the hardware.
That naturally depends on the age of the hardware—eventually it will become useless when it’s power + maintenance cost (which is also mostly power/thermal driven) exceeds value.
For example—for a 3090 right now the base mining value (and thus market rate) is about $8/day, for about $1/day of electricity (at $0.15 / kwhr) + $1/day for cooling (1:1 is a reasonable rule of thumb, but obviously depends on environment), so power/thermal is about 25% vs say 10% discount rate and 65% depreciation. Whereas it’s more 50⁄50 for an older 1080 ti.
Power in = power out, but a power limit is quite different to a thermal limit. An embedded microcontroller running off a watch battery still obeys power in = power out, but is generally only limited by how much power you can put in, not its thermals.
This doesn’t make sense to me—in what sense are they not thermally limited? Nvidia could not viably put out a consumer GPU that used 3 kilowatts for example.
This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
It is true that devices are thermally limited in the sense that they could run faster if they had more power, but because of the steep efficiency curve, this is not at all the same as saying that they could not productively use more transistors, nor does it directly corresponds to dark silicon in a meaningful way. The power level is a balance between this performance increase and the cost of the power draw (which includes things like the cost of the power supplies and heatsink). As the slope of power needed per unit extra performance effectively approaches infinity, you will always find that the optimal trade-off is below theoretical peak performance.
If you add more transistors without improving those transistors’ power efficiency, and without improving power extraction, you can initially just run those greater number of transistors at a more optimal power ratio.
A 2x density scaling without a 2x energy efficiency scaling just results in 2x higher dark silicon ratio—this is already the case and why nvidia’s recent GPU dies are increasingly split into specialized components: FP/int, tensorcore, ray tracing, etc.
This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
I’m not sure what you mean here—from what I recall the flip/J metrics of the low power/mobile process nodes are on the order of 25% gains or so, not 100%. Phones/laptops have smaller processor dies and more dark silicon, not dramatically more efficient transistors.
The point isn’t the logic cell, those tend to be marginal improvements as you say. The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080). Phone GPUs are much more efficient again than laptop GPUs.
It is true that a phone SoC has more dark silicon than a dedicated GPU, but this is just because phone SoCs do a lot of disparate tasks, which are individually optimized for. Their GPUs are not particularly more dark than other GPUs, and GPUs in general are not particularly more dark than necessary for their construction.
It should also be noted that dark silicon is not the same as wasted silicon.
$1/day of electricity (at $0.15 / kwhr) + $1/day for cooling (1:1 is a reasonable rule of thumb, but obviously depends on environment)
This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
Are you talking about clock rates? Those haven’t changed for GPUs in a while, I’m assuming they will remain essentially fixed. Doubling the power into a desktop GPU at fixed clock rate (and ignoring dark silicon fraction) thus corresponds to doubling the transistor count (at the same transistor energy efficiency), which would double performance, power, and thermal draw all together.
This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere, and I’m assuming the same probably applies at some level internally for the many paths inside tensorcores and RT cores. I am less certain about perf/power for simultaneously maxing tensorcores+RTcores+alucores+mempaths, but I’m guessing it would thermal limit and underclock to some degree.
The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080).
Primarily through lowered clock rates or dark silicon. I ignored clock rates because they seem irrelevant for the future of Moore’s law.
Google has unusually efficient data-centers, but I’d also bet that efficiency measure isn’t for a pure GPU datacenter, which would have dramatically higher energy density and thus cooling challenges than their typical light CPU heavy storage search-optimized servers.
Clock rate is relevant. Or rather, the underlying aspects that in part determine clock rate are relevant. It is true that doubling transistor density while holding all else equal would require much more thermal output, but it’s not the only option, were thermal constraints the dominant factor.
I agree there is only so much room to be gained here, which would quickly vanish in the face of exponential trends, but this part of our debate came up in the context of whether current GPUs are already past this point. I claim they aren’t, and that being so far past the point of maximal energy efficiency is evidence of it.
Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere
This doesn’t make sense technically; if anything Ampere moves in the opposite direction, by making both datapaths be able to do FP simultaneously (though this is ultimately a mild effect that isn’t really relevant). To quote the GA102 whitepaper,
Most graphics workloads are composed of 32-bit floating point (FP32) operations. The Streaming Multiprocessor (SM) in the Ampere GA10x GPU Architecture has been designed to support double-speed processing for FP32 operations. In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10x includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. As a result, GeForce RTX 3090 delivers over 35 FP32 TFLOPS, an improvement of over 2x compared to Turing GPUs.
I briefly looked for the source for your comment and didn’t find it.
Google has unusually efficient data-centers
We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
I was aware the 3090 had 2x FP32, but I thought that dual FP thing was specific to the GA102. Actually the GA102 just has 2x the ALU cores per SM vs the GA100.
We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
There are efficiency transitions from passive to active, air to liquid, etc, that all depend on energy density.
I agree this seems to be our main departure.
You seem to be conflating two limits, power limits, as in how much energy we can put into a system, and thermal limits, as in how much energy can we extract from that system to cool it down.
With regards to thermal limits, GPUs run fairly far into the diminishing returns of their power-performance curve, and pushing them further, even with liquid nitrogen, doesn’t help by a disproportionate amount. NVIDIA is pushing significantly more power into their top end GPUs than they need to approximately hit their peak performance. Compare phone to laptop to desktop GPUs; efficiency/transistor improves drastically as power goes down. So it seems to me like GPUs are not yet thermally limited, in the sense that having more transistor density would still allow performance scaling even in lieu of those transistors becoming more efficient.
Arguably this could be a result of architectural trade-offs prioritizing mobile, but flagships sell cards, and so if NVIDIA was willing to give those cards so much power, they should optimize for them to consume that much power. I’d also expect that to pan out as a greater advantage for competitors that target servers specifically, which we don’t see. Anyhow, this isn’t a physical limit, as there exist much better ways to extract heat than we are currently using, if this was something that scaling needed doing.
You seem mostly concerned on the other point, which is the power limit, specifically those derived from the price for that power. My understanding is that power is a significant fraction of server costs, but still significantly less than the amortized cost of the hardware.
I didn’t use the word thermal, but of course they are trivially related as power in = heat out for irreversible computers, so power/thermal limit can be used interchangeably in that sense. GPUs (and any processor really) have a power/thermal design limit based on what’s commercially feasible to support both in terms of the power supply and the required cooling.
This doesn’t make sense to me—in what sense are they not thermally limited? Nvidia could not viably put out a consumer GPU that used 3 kilowatts for example. The RTX 3090 pushing power draw up to 350 watts was a big deal. Enterprise GPUs are even more power constrained, if anything (the flagship A100 uses 250 watts—although I believe it’s using a slightly better TSMC node rather than samsung) - and also enormously more expensive per flop.
A 2x density scaling without a 2x energy efficiency scaling just results in 2x higher dark silicon ratio—this is already the case and why nvidia’s recent GPU dies are increasingly split into specialized components: FP/int, tensorcore, ray tracing, etc.
I’m not sure what you mean here—from what I recall the flip/J metrics of the low power/mobile process nodes are on the order of 25% gains or so, not 100%. Phones/laptops have smaller processor dies and more dark silicon, not dramatically more efficient transistors.
That naturally depends on the age of the hardware—eventually it will become useless when it’s power + maintenance cost (which is also mostly power/thermal driven) exceeds value.
For example—for a 3090 right now the base mining value (and thus market rate) is about $8/day, for about $1/day of electricity (at $0.15 / kwhr) + $1/day for cooling (1:1 is a reasonable rule of thumb, but obviously depends on environment), so power/thermal is about 25% vs say 10% discount rate and 65% depreciation. Whereas it’s more 50⁄50 for an older 1080 ti.
Power in = power out, but a power limit is quite different to a thermal limit. An embedded microcontroller running off a watch battery still obeys power in = power out, but is generally only limited by how much power you can put in, not its thermals.
This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
It is true that devices are thermally limited in the sense that they could run faster if they had more power, but because of the steep efficiency curve, this is not at all the same as saying that they could not productively use more transistors, nor does it directly corresponds to dark silicon in a meaningful way. The power level is a balance between this performance increase and the cost of the power draw (which includes things like the cost of the power supplies and heatsink). As the slope of power needed per unit extra performance effectively approaches infinity, you will always find that the optimal trade-off is below theoretical peak performance.
If you add more transistors without improving those transistors’ power efficiency, and without improving power extraction, you can initially just run those greater number of transistors at a more optimal power ratio.
This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
The point isn’t the logic cell, those tend to be marginal improvements as you say. The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080). Phone GPUs are much more efficient again than laptop GPUs.
It is true that a phone SoC has more dark silicon than a dedicated GPU, but this is just because phone SoCs do a lot of disparate tasks, which are individually optimized for. Their GPUs are not particularly more dark than other GPUs, and GPUs in general are not particularly more dark than necessary for their construction.
It should also be noted that dark silicon is not the same as wasted silicon.
Note that Google claims ~10:1.
I’m not convinced mining is a good proxy here, their market is weird, but it sounds like you agree that power is a significant but lesser cost.
Are you talking about clock rates? Those haven’t changed for GPUs in a while, I’m assuming they will remain essentially fixed. Doubling the power into a desktop GPU at fixed clock rate (and ignoring dark silicon fraction) thus corresponds to doubling the transistor count (at the same transistor energy efficiency), which would double performance, power, and thermal draw all together.
Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere, and I’m assuming the same probably applies at some level internally for the many paths inside tensorcores and RT cores. I am less certain about perf/power for simultaneously maxing tensorcores+RTcores+alucores+mempaths, but I’m guessing it would thermal limit and underclock to some degree.
Primarily through lowered clock rates or dark silicon. I ignored clock rates because they seem irrelevant for the future of Moore’s law.
Google has unusually efficient data-centers, but I’d also bet that efficiency measure isn’t for a pure GPU datacenter, which would have dramatically higher energy density and thus cooling challenges than their typical light CPU heavy storage search-optimized servers.
Clock rate is relevant. Or rather, the underlying aspects that in part determine clock rate are relevant. It is true that doubling transistor density while holding all else equal would require much more thermal output, but it’s not the only option, were thermal constraints the dominant factor.
I agree there is only so much room to be gained here, which would quickly vanish in the face of exponential trends, but this part of our debate came up in the context of whether current GPUs are already past this point. I claim they aren’t, and that being so far past the point of maximal energy efficiency is evidence of it.
This doesn’t make sense technically; if anything Ampere moves in the opposite direction, by making both datapaths be able to do FP simultaneously (though this is ultimately a mild effect that isn’t really relevant). To quote the GA102 whitepaper,
I briefly looked for the source for your comment and didn’t find it.
We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
I was aware the 3090 had 2x FP32, but I thought that dual FP thing was specific to the GA102. Actually the GA102 just has 2x the ALU cores per SM vs the GA100.
There are efficiency transitions from passive to active, air to liquid, etc, that all depend on energy density.