I think I’m understanding where you’re coming from a bit more now, thanks. So, when I wrote:
The H100, taken as a whole, is on the order of a million times away from the Landauer limit at its operating temperature.
My intended meaning in context was “taking the asspull as an assumption, the abstract computational thing an H100 is doing that is relevant to ML (without caring about the hardware used to accomplish it, and implicitly assuming a move to more ML-optimized architectures) is very roughly 6 OOMs off the absolute lower bound, while granting that the lower bound is not achievable due to the spherical-cow violating details like error rates and not-just-logic and the rest.”
I gather it sounded to you more like, “we can make a GPU with a similar architecture a million times more energy efficient through Moore-like advancements.”
I’ll see if I can come up with some edits that keep it concise while being clearer.
That said, I am dubious about the predicted CMOS scaling endpoint implying a 4090 is only about 2-3x away from minimal switching+interconnect costs. That’s very hard to square with the fact that the 4090 is shipping with extreme clock rates and supporting voltages to meet the expectations of a halo gaming product. Due to the nonlinear curves involved, I wouldn’t be surprised if a 4090 underclocked and undervolted to its efficiency sweetspot is very close to, or even below, the predicted minimum. (Something like a 6700XT on TSMC 7 nm at 1500 mhz is ~2.5x more efficient per clock than at 2600 mhz.)
Here’s an attempt with Apple’s M1 Ultra, on a similar N5 process: Total draw: ~180W (60W CPU + 120W GPU) Transistor count: 114B GPU clock: 1.3ghz E/P core maximum frequency: 2.064ghz/3.228ghz
In the absence of good numbers for the cpu/gpu split, let’s assume it’s similar to the difference between a 7950x (13.1B) and a 4080 12GB (35.8B), or around 27% CPU. Assuming all CPU cores are running at the conservative E core maximum frequency of 2.064ghz:
It is more apparent on low power systems like phones. The iPhone 14 uses an Apple A16 with 2 P cores at a maximum of 3.46ghz, 4 E cores at 2.02ghz, and a 5 core GPU at unstated max frequency and manufactured on the N4 process. It has 16B transistors. Getting good numbers for the transistor split, frequency, and power consumption is even trickier here, but a good estimate for power seems to be a maximum of around 3.6W.
For simplicity, I’ll assume the whole chip is at only 1ghz, which is likely around where the GPU would run (and I doubt the SoC would ever be permitted to max out all frequencies at once for any significant period of time; that would probably be a bit higher than 3.6W). Repeating the exercise:
3.6 / (0.5 * 16e9 * 1e9) = 4.5e-19
I could be missing something or making some simple mistake here, but it seems like even pretty conservative estimates suggest the ~2e-18 bound isn’t correct. One way out would be that the amount of dark silicon is actually much higher than 50%, but that would in turn suggest there’s more headroom.
The results do fit with my expectations, though- if we were extremely close to a wall, I’d expect to see doubling times that are much, much longer (to the point of ‘doubling times’ being a silly way to even talk about progress). Instead, indications are that N3 will probably give another bump in efficiency and density fairly similar to that of N5 over N7 or N7 over N10. And, while trusting corpo-PR is always a little iffy, they do appear confident in N2 continuing the trend for power efficiency. (It is true the density increase from N3E to N2 is smaller for their first iteration on the node.)
(To be clear, again, I don’t think we can get another million times improvement in switching energy or overall efficiency in irreversible computers. It is going to slow down pretty hard, pretty soon, more than it already has, unless crazy miracles happen. Just not enough for it to matter for AGI, in my expectation.)
Hmm actually the 0.5 would assume full bright silicon, all 100% in use, because they only switch about half the time on average. So really it should be 0.5*a, where a is some activity factor, and I do think we are entering dark silicon era to some degree. Consider the nvidia tensorcores, and all the different bit pathways they have. Those may share some sub parts, but seems unlikely they share everything.
Also CPUs tend to be mostly SRAM cache, which has much lower activity level.
I think I’m understanding where you’re coming from a bit more now, thanks. So, when I wrote:
My intended meaning in context was “taking the asspull as an assumption, the abstract computational thing an H100 is doing that is relevant to ML (without caring about the hardware used to accomplish it, and implicitly assuming a move to more ML-optimized architectures) is very roughly 6 OOMs off the absolute lower bound, while granting that the lower bound is not achievable due to the spherical-cow violating details like error rates and not-just-logic and the rest.”
I gather it sounded to you more like, “we can make a GPU with a similar architecture a million times more energy efficient through Moore-like advancements.”
I’ll see if I can come up with some edits that keep it concise while being clearer.
That said, I am dubious about the predicted CMOS scaling endpoint implying a 4090 is only about 2-3x away from minimal switching+interconnect costs. That’s very hard to square with the fact that the 4090 is shipping with extreme clock rates and supporting voltages to meet the expectations of a halo gaming product. Due to the nonlinear curves involved, I wouldn’t be surprised if a 4090 underclocked and undervolted to its efficiency sweetspot is very close to, or even below, the predicted minimum. (Something like a 6700XT on TSMC 7 nm at 1500 mhz is ~2.5x more efficient per clock than at 2600 mhz.)
Here’s an attempt with Apple’s M1 Ultra, on a similar N5 process:
Total draw: ~180W (60W CPU + 120W GPU)
Transistor count: 114B
GPU clock: 1.3ghz
E/P core maximum frequency: 2.064ghz/3.228ghz
In the absence of good numbers for the cpu/gpu split, let’s assume it’s similar to the difference between a 7950x (13.1B) and a 4080 12GB (35.8B), or around 27% CPU. Assuming all CPU cores are running at the conservative E core maximum frequency of 2.064ghz:
CPU: 60 / (0.5 * 0.27 * 114e9 * 2.064e9) = 1.89e-18
GPU: 120 / (0.5 * 0.73 * 114e9 * 1.3e9) = 2.21e-18
It is more apparent on low power systems like phones. The iPhone 14 uses an Apple A16 with 2 P cores at a maximum of 3.46ghz, 4 E cores at 2.02ghz, and a 5 core GPU at unstated max frequency and manufactured on the N4 process. It has 16B transistors. Getting good numbers for the transistor split, frequency, and power consumption is even trickier here, but a good estimate for power seems to be a maximum of around 3.6W.
For simplicity, I’ll assume the whole chip is at only 1ghz, which is likely around where the GPU would run (and I doubt the SoC would ever be permitted to max out all frequencies at once for any significant period of time; that would probably be a bit higher than 3.6W). Repeating the exercise:
3.6 / (0.5 * 16e9 * 1e9) = 4.5e-19
I could be missing something or making some simple mistake here, but it seems like even pretty conservative estimates suggest the ~2e-18 bound isn’t correct. One way out would be that the amount of dark silicon is actually much higher than 50%, but that would in turn suggest there’s more headroom.
The results do fit with my expectations, though- if we were extremely close to a wall, I’d expect to see doubling times that are much, much longer (to the point of ‘doubling times’ being a silly way to even talk about progress). Instead, indications are that N3 will probably give another bump in efficiency and density fairly similar to that of N5 over N7 or N7 over N10. And, while trusting corpo-PR is always a little iffy, they do appear confident in N2 continuing the trend for power efficiency. (It is true the density increase from N3E to N2 is smaller for their first iteration on the node.)
(To be clear, again, I don’t think we can get another million times improvement in switching energy or overall efficiency in irreversible computers. It is going to slow down pretty hard, pretty soon, more than it already has, unless crazy miracles happen. Just not enough for it to matter for AGI, in my expectation.)
Hmm actually the 0.5 would assume full bright silicon, all 100% in use, because they only switch about half the time on average. So really it should be 0.5*a, where a is some activity factor, and I do think we are entering dark silicon era to some degree. Consider the nvidia tensorcores, and all the different bit pathways they have. Those may share some sub parts, but seems unlikely they share everything.
Also CPUs tend to be mostly SRAM cache, which has much lower activity level.