jacob_cannell comments on Moore’s Law, AI, and the pace of progress

jacob_cannell 12 Dec 2021 19:46 UTC
8 points
My first and second impression on reading this is I want to bet against you, but I’m not even quite clear on what specific bet you are taking against Jensen/IRS/myself when you say:
The natural implication is that device scaling has already stalled and will soon hit a wall, that scaling out much further is uneconomical, and in conclusion that AI progress cannot be driven much further through scaling, certainly not soon, and possibly not ever.
I disagree with this view. My argument is structured into a few key points.
Because you are hedging bets there, and then also here:
I want to emphasize here, these laws set a baseline expectation for future progress. A history of false alarms should give you some caution when you hear another alarm without qualitatively better justification. This does not mean Moore’s Law will not end; it will. This does not even mean it won’t end soon, or suddenly; it very well might.
So what I’m wondering is what is your more exact distribution over Moore’s Law? To be specific, what is your distribution over the future graph of ops/$ or ops/J, such that it even disagrees with the mainstream (Jensen/IRS/myself/etc)?
To hold myself to that same standard, I predict that for standard available GPUs/TPUs/etc (irreversible parallel von-neumann machines), about 65% chance we can squeeze about 10x more ops/J out by 2028 (Moravec’s prediction of AGI), and only about a 10% chance we can squeeze out about 100x more ops/J.
Do you disagree? I believe ops/$ will be mostly dominated by ops/J.
The wildcard is neuromorphic computing, which can allow somewhat better-than brain (say 10x) or so noisy analog ops/J. But that’s a separate discussion, and those chips won’t run current DL well, they are mostly only good for more explicitly brain-like AGI.
- Veedrac 12 Dec 2021 21:18 UTC
  5 points
  Parent
  To hold myself to that same standard, I predict that for standard available GPUs/TPUs/etc (irreversible parallel von-neumann machines), about 65% chance we can squeeze about 10x more ops/J out by 2028 (Moravec’s prediction of AGI), and only about a 10% chance we can squeeze out about 100x more ops/J.
  2028 is 6 years and change away. Even a straight-line extrapolation of transistor density wouldn’t quite make a 10x improvement versus today’s cutting edge, and that scales better than switches-per-joule. So if we’re ignoring the device architecture, I think I’m more pessimal than you!
  I don’t address ops/J in the article, though I respond to the question here. It seems totally reasonable to me that compute is eventually limited by energy production. At the same time, we are not currently anywhere near the limits of how much power we could feasibly pump into (or extract from) any one given supercomputer, and at minimum we have some power scaling left to expect from the roadmap.
  You’re right to call out the hedging in my article, but it is legitimate uncertainty. I expect progress to about match IRDS until 2028, but predictions have been wrong, and I didn’t want people to take Moore’s Law’s seeming historic inviolability as evidence that it actually is inviolable.
  To try to clarify and enumerate the relevant stances from the article,
  1. The reports of Moore’s Law’s death have been greatly exaggerated, as it applies to current and historical trends.
  2. I expect business as usual until at least around when the IRDS stops doing so, aka. 2028, after which IRDS expects scaling to come from 3D stacking of transistors.
  3. AI performance will grow about proportionally to the product of transistor density and frequency, notwithstanding major computer architecture changes.
  4. Some memory technology will displace traditional DRAM, likely this decade, with much better scaling properties. Plausibly several will.
  5. You will see other forms of scaling, like 3D integration, continually make progress, though I’m not staking a claim on any given exponential rate.
  6. Scaling up will happen proportionally to the money spent on compute, in the sense that we will not reach the point where we are physics limited, rather than resource limited, in how big AI systems can be.
  7. I give some examples of feasible systems much larger and more capable than today’s.
  If any of these don’t match what you got from the article, please point it out and I’ll try to fix the discrepancy.
  - jacob_cannell 12 Dec 2021 22:18 UTC
    6 points
    Parent
    I don’t address ops/J in the article, though I respond to the question here. It seems totally reasonable to me that compute is eventually limited by energy production.
    Ok that might be a crux. I am claiming that new GPU designs are already energy limited; that is the main constraint GPU engineers care about.
    I will update my off-the-cuff prediction with something more calibrated for posterity (that was an initial zero-effort guess), but I’m not ignoring device architecture. For the 2028 timeframe it’s more like only ~2x op/J increase from semiconductor process (when measuring say transistor flips/J), and ~5x op/J from low level architecture improvement in low-precision matrix multiply units (or say ~5x and ~20x for my lower prob estimate). I’m specifically talking about GPU/TPU style processors, not neuromorphic, as described earlier. (In part because I believe GPU/TPU will take us to AGI before neuromorphic matters) Much more of the pre-neuromorphic gain will come from software.
    I believe 1.) is actually easy-to-estimate from physics, I’ve read said physics/ECE papers outlining the exact end of moore’s law, and I’m assuming Jensen-et-al has as well (and has deeper inside knowledge). The main constraint is more transit energy than transistor flip energy.
    2.) Doesn’t actually extend Moore’s Law (at least by the useful definitions I’m using)
    3.) GPUs aren’t limited by transistor count, they are limited by power—ie we are already well into ‘Dark Silicon’ era.
    4.) This is already priced in, and doesn’t help enough.
    5.) Doesn’t help logic enough because of power/heat issues, but it’s already important and priced in for RAM (eg HBM).
    6.) I mean that’s an independent scaling axis—you can always spend more on compute, and we probably have more OOM of slack there? Orthogonal to the Moore’s Law predictions
    7.) I’ll reply to those feasible system examples separately after looking more closely.
    - Veedrac 12 Dec 2021 23:23 UTC
      4 points
      Parent
      Ok that might be a crux. I am claiming that new GPU designs are already energy limited; that is the main constraint GPU engineers care about.
      I agree this seems to be our main departure.
      You seem to be conflating two limits, power limits, as in how much energy we can put into a system, and thermal limits, as in how much energy can we extract from that system to cool it down.
      With regards to thermal limits, GPUs run fairly far into the diminishing returns of their power-performance curve, and pushing them further, even with liquid nitrogen, doesn’t help by a disproportionate amount. NVIDIA is pushing significantly more power into their top end GPUs than they need to approximately hit their peak performance. Compare phone to laptop to desktop GPUs; efficiency/transistor improves drastically as power goes down. So it seems to me like GPUs are not yet thermally limited, in the sense that having more transistor density would still allow performance scaling even in lieu of those transistors becoming more efficient.
      Arguably this could be a result of architectural trade-offs prioritizing mobile, but flagships sell cards, and so if NVIDIA was willing to give those cards so much power, they should optimize for them to consume that much power. I’d also expect that to pan out as a greater advantage for competitors that target servers specifically, which we don’t see. Anyhow, this isn’t a physical limit, as there exist much better ways to extract heat than we are currently using, if this was something that scaling needed doing.
      You seem mostly concerned on the other point, which is the power limit, specifically those derived from the price for that power. My understanding is that power is a significant fraction of server costs, but still significantly less than the amortized cost of the hardware.
      - jacob_cannell 13 Dec 2021 1:10 UTC
        3 points
        Parent
        You seem to be conflating two limits, power limits, as in how much energy we can put into a system, and thermal limits, as in how much energy can we extract from that system to cool it down.
        
        I didn’t use the word thermal, but of course they are trivially related as power in = heat out for irreversible computers, so power/thermal limit can be used interchangeably in that sense. GPUs (and any processor really) have a power/thermal design limit based on what’s commercially feasible to support both in terms of the power supply and the required cooling.
        So it seems to me like GPUs are not yet thermally limited, in the sense that having more transistor density would still allow performance scaling even in lieu of those transistors becoming more efficient.
        This doesn’t make sense to me—in what sense are they not thermally limited? Nvidia could not viably put out a consumer GPU that used 3 kilowatts for example. The RTX 3090 pushing power draw up to 350 watts was a big deal. Enterprise GPUs are even more power constrained, if anything (the flagship A100 uses 250 watts—although I believe it’s using a slightly better TSMC node rather than samsung) - and also enormously more expensive per flop.
        A 2x density scaling without a 2x energy efficiency scaling just results in 2x higher dark silicon ratio—this is already the case and why nvidia’s recent GPU dies are increasingly split into specialized components: FP/int, tensorcore, ray tracing, etc.
        Compare phone to laptop to desktop GPUs; efficiency/transistor improves drastically as power goes down.
        I’m not sure what you mean here—from what I recall the flip/J metrics of the low power/mobile process nodes are on the order of 25% gains or so, not 100%. Phones/laptops have smaller processor dies and more dark silicon, not dramatically more efficient transistors.
        My understanding is that power is a significant fraction of server costs, but still significantly less than the amortized cost of the hardware.
        That naturally depends on the age of the hardware—eventually it will become useless when it’s power + maintenance cost (which is also mostly power/thermal driven) exceeds value.
        For example—for a 3090 right now the base mining value (and thus market rate) is about $8/day, for about $1/day of electricity (at $0.15 / kwhr) + $1/day for cooling (1:1 is a reasonable rule of thumb, but obviously depends on environment), so power/thermal is about 25% vs say 10% discount rate and 65% depreciation. Whereas it’s more ⁵⁰⁄₅₀ for an older 1080 ti.
        Veedrac 13 Dec 2021 18:23 UTC
        3 points
        Parent
        Power in = power out, but a power limit is quite different to a thermal limit. An embedded microcontroller running off a watch battery still obeys power in = power out, but is generally only limited by how much power you can put in, not its thermals.
        This doesn’t make sense to me—in what sense are they not thermally limited? Nvidia could not viably put out a consumer GPU that used 3 kilowatts for example.
        This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
        It is true that devices are thermally limited in the sense that they could run faster if they had more power, but because of the steep efficiency curve, this is not at all the same as saying that they could not productively use more transistors, nor does it directly corresponds to dark silicon in a meaningful way. The power level is a balance between this performance increase and the cost of the power draw (which includes things like the cost of the power supplies and heatsink). As the slope of power needed per unit extra performance effectively approaches infinity, you will always find that the optimal trade-off is below theoretical peak performance.
        If you add more transistors without improving those transistors’ power efficiency, and without improving power extraction, you can initially just run those greater number of transistors at a more optimal power ratio.
        A 2x density scaling without a 2x energy efficiency scaling just results in 2x higher dark silicon ratio—this is already the case and why nvidia’s recent GPU dies are increasingly split into specialized components: FP/int, tensorcore, ray tracing, etc.
        This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
        I’m not sure what you mean here—from what I recall the flip/J metrics of the low power/mobile process nodes are on the order of 25% gains or so, not 100%. Phones/laptops have smaller processor dies and more dark silicon, not dramatically more efficient transistors.
        The point isn’t the logic cell, those tend to be marginal improvements as you say. The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080). Phone GPUs are much more efficient again than laptop GPUs.
        It is true that a phone SoC has more dark silicon than a dedicated GPU, but this is just because phone SoCs do a lot of disparate tasks, which are individually optimized for. Their GPUs are not particularly more dark than other GPUs, and GPUs in general are not particularly more dark than necessary for their construction.
        It should also be noted that dark silicon is not the same as wasted silicon.
        $1/day of electricity (at $0.15 / kwhr) + $1/day for cooling (1:1 is a reasonable rule of thumb, but obviously depends on environment)
        Note that Google claims ~10:1.
        I’m not convinced mining is a good proxy here, their market is weird, but it sounds like you agree that power is a significant but lesser cost.
        jacob_cannell 13 Dec 2021 19:16 UTC
        3 points
        Parent
        This is the wrong angle to look at this question. Efficiency is a curve. At the point desktop GPUs sit at, large changes to power result in much smaller changes to performance. Doubling the power into a top end desktop GPU would not increase its performance by anywhere near double, and similarly halving the power only marginally reduces the performance.
        Are you talking about clock rates? Those haven’t changed for GPUs in a while, I’m assuming they will remain essentially fixed. Doubling the power into a desktop GPU at fixed clock rate (and ignoring dark silicon fraction) thus corresponds to doubling the transistor count (at the same transistor energy efficiency), which would double performance, power, and thermal draw all together.
        This is not true. GPUs can run shader cores and RT cores at the same time, for example. The reason for dedicated hardware for AI and ray tracing is that dedicated hardware is significantly more efficient (both per transistor and per watt) at doing those tasks.
        Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere, and I’m assuming the same probably applies at some level internally for the many paths inside tensorcores and RT cores. I am less certain about perf/power for simultaneously maxing tensorcores+RTcores+alucores+mempaths, but I’m guessing it would thermal limit and underclock to some degree.
        The point is that those products are operating at a much more efficient point on the power-performance curve. Laptop NVIDIA GPUs are identical dies to their desktop dies (though not always to the same model number; a 3080 Mobile is a desktop 3070 Ti, not a desktop 3080).
        Primarily through lowered clock rates or dark silicon. I ignored clock rates because they seem irrelevant for the future of Moore’s law.
        Note that Google claims ~10:1.
        Google has unusually efficient data-centers, but I’d also bet that efficiency measure isn’t for a pure GPU datacenter, which would have dramatically higher energy density and thus cooling challenges than their typical light CPU heavy storage search-optimized servers.
        Veedrac 16 Dec 2021 19:27 UTC
        1 point
        Parent
        Clock rate is relevant. Or rather, the underlying aspects that in part determine clock rate are relevant. It is true that doubling transistor density while holding all else equal would require much more thermal output, but it’s not the only option, were thermal constraints the dominant factor.
        I agree there is only so much room to be gained here, which would quickly vanish in the face of exponential trends, but this part of our debate came up in the context of whether current GPUs are already past this point. I claim they aren’t, and that being so far past the point of maximal energy efficiency is evidence of it.
        Jensen explicitly mentioned dark silicon as motivator in some presentation about the new separate FP/int paths in ampere
        This doesn’t make sense technically; if anything Ampere moves in the opposite direction, by making both datapaths be able to do FP simultaneously (though this is ultimately a mild effect that isn’t really relevant). To quote the GA102 whitepaper,
        Most graphics workloads are composed of 32-bit floating point (FP32) operations. The Streaming Multiprocessor (SM) in the Ampere GA10x GPU Architecture has been designed to support double-speed processing for FP32 operations. In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10x includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. As a result, GeForce RTX 3090 delivers over 35 FP32 TFLOPS, an improvement of over 2x compared to Turing GPUs.
        I briefly looked for the source for your comment and didn’t find it.
        Google has unusually efficient data-centers
        We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
        jacob_cannell 17 Dec 2021 5:35 UTC
        2 points
        Parent
        I was aware the 3090 had 2x FP32, but I thought that dual FP thing was specific to the GA102. Actually the GA102 just has 2x the ALU cores per SM vs the GA100.
        
        We are interested in the compute frontier, so this is still relevant. I don’t share the intuition that higher energy density would make cooling massively less efficient.
        
        There are efficiency transitions from passive to active, air to liquid, etc, that all depend on energy density.