Part of the issue is my post/comment was about moore’s law (transistor density for mass produced nodes), which is a major input to but distinct from flops/$. As I mentioned somewhere, there is still some free optimization energy in extracting more flops/$ at the circuit level even if moore’s law ends. Moore’s law is very specifically about fab efficiency as measured in transistors/cm^2 for large chip runs—not the flops/$ habyrka wanted to bet on. Even when moore’s law is over, I expect some continued progress in flops/$.
All that being said, nvidia’s new flagship GPU everyone is using—the H100 which is replacing the A100 and launched just a bit after habryka proposed the bet—actually offers near zero improvement in flops/$ (the price increased in direct proportion to flops increase). So I probably should have taken the bet if it was narrowly defined as (flops/$ for the flagship gpus most teams using currently for training foundation models).
Thanks Jacob. I’ve been reading the back-and-forth between you and other commenters (not just habryka) in both this post and your brain efficiency writeup, and it’s confusing to me why some folks so confidently dismiss energy efficiency considerations with handwavy arguments not backed by BOTECs.
While I have your attention – do you have a view on how far we are from ops/J physical limits? Your analysis suggests we’re only 1-2 OOMs away from the ~10^-15 J/op limit, and if I’m not misapplying Koomey’s law (2x every 2.5y back in 2015, I’ll assume slowdown to 3y doubling by now) this suggests we’re only 10-20 years away, which sounds awfully near, albeit incidentally in the ballpark of most AGI timelines (yours, Metaculus etc).
TSMC 4N is a little over 1e10 transistors/cm^2 for GPUs and roughly 5e^-18 J switch energy assuming dense activity (little dark silicon). The practical transistor density limit with minimal few electron transistors is somewhere around ~5e11 trans/cm^2, but the minimal viable high speed switching energy is around ~2e^-18J. So there is another 1 to 2 OOM further density scaling, but less room for further switching energy reduction. Thus scaling past this point increasingly involves dark silicon or complex expensive cooling and thus diminishing returns either way.
Achieving 1e-15 J/flop seems doable now for low precision flops (fp4, perhaps fp8 with some tricks/tradeoffs); most of the cost is data movement as pulling even a single bit from RAM just 1 cm away costs around 1e-12J.
Part of the issue is my post/comment was about moore’s law (transistor density for mass produced nodes), which is a major input to but distinct from flops/$. As I mentioned somewhere, there is still some free optimization energy in extracting more flops/$ at the circuit level even if moore’s law ends. Moore’s law is very specifically about fab efficiency as measured in transistors/cm^2 for large chip runs—not the flops/$ habyrka wanted to bet on. Even when moore’s law is over, I expect some continued progress in flops/$.
All that being said, nvidia’s new flagship GPU everyone is using—the H100 which is replacing the A100 and launched just a bit after habryka proposed the bet—actually offers near zero improvement in flops/$ (the price increased in direct proportion to flops increase). So I probably should have taken the bet if it was narrowly defined as (flops/$ for the flagship gpus most teams using currently for training foundation models).
Thanks Jacob. I’ve been reading the back-and-forth between you and other commenters (not just habryka) in both this post and your brain efficiency writeup, and it’s confusing to me why some folks so confidently dismiss energy efficiency considerations with handwavy arguments not backed by BOTECs.
While I have your attention – do you have a view on how far we are from ops/J physical limits? Your analysis suggests we’re only 1-2 OOMs away from the ~10^-15 J/op limit, and if I’m not misapplying Koomey’s law (2x every 2.5y back in 2015, I’ll assume slowdown to 3y doubling by now) this suggests we’re only 10-20 years away, which sounds awfully near, albeit incidentally in the ballpark of most AGI timelines (yours, Metaculus etc).
TSMC 4N is a little over 1e10 transistors/cm^2 for GPUs and roughly 5e^-18 J switch energy assuming dense activity (little dark silicon). The practical transistor density limit with minimal few electron transistors is somewhere around ~5e11 trans/cm^2, but the minimal viable high speed switching energy is around ~2e^-18J. So there is another 1 to 2 OOM further density scaling, but less room for further switching energy reduction. Thus scaling past this point increasingly involves dark silicon or complex expensive cooling and thus diminishing returns either way.
Achieving 1e-15 J/flop seems doable now for low precision flops (fp4, perhaps fp8 with some tricks/tradeoffs); most of the cost is data movement as pulling even a single bit from RAM just 1 cm away costs around 1e-12J.