What rough probability do you assign to a 10x improvement in efficiency for ML tasks (GPU or not) within 20 years?
What rough probability do you assign to a 100x improvement in efficiency for ML tasks (GPU or not) within 20 years?
My understanding is that we actually agree about the important parts of hardware, at least to the degree I think this question is even relevant to AGI at this point. I think we may disagree about the software side, I’m not sure.
I do agree I left a lot out of the hardware limits analysis, but largely because I don’t think it is enough to move the needle on the final conclusion (and the post is already pretty long!).
I agree with you that we may already have enough compute, but I called this out mostly because it struck me as quick/sloppy overconfident analysis (or perhaps we just disagree on the physics) which distracted from your other arguments.
Scanning through your other post, I don’t think we disagree on the physics regarding ML-relevant compute. It is a quick and simplistic analysis, yes- my intent there was really just to say “hardware bottlenecks sure don’t look like they’re going to arrive soon enough to matter, given the rest of this stuff.” The exact amount of headroom we have left and everything that goes into that estimation just didn’t seem worth including given the length and low impact. (I would have chosen differently if those details changed the conclusion of the section.)
I am curious as to what part felt overconfident to you. I attempted to lampshade the nature of the calculations with stuff like “napkin math” and “asspull,” but there may be some other phrasing that indicated undue certainty.
I have gone back and forth about the value of the section- it’s one of the least important for the actual argument, but it seemed worth it to have a brief blurb. It’s possible that I just don’t quite understand the vibe you’re getting from it.
For example, in your original comment:
there are good reasons why the semiconductor roadmap has ended and the perception in industry is that Moore’s Law is finally approaching it’s end.
I was a little confused by this, because it sounds like my post made you think I think Moore’s law will continue unhindered or that there are no massive problems in the next 20 years for semiconductor manufacturing. In reality, I agree, that set of technologies is in the latter stages of its sigmoid. (For example, the Q&A about me underplaying the slowdown in Moore’s law.)
If there’s some misleading wording somewhere that I can fix easily, I’d like to.
Yeah it was the asspull part, which I mostly noticed as Landauer, and this:
The H100, taken as a whole, is on the order of a million times away from the Landauer limit at its operating temperature.
Well instead of using the asspull math, you can look at the analysis in the engineering literature. At a really high level, you can just look at the end of the ITRS roadmap. The scaling physics for CMOS are reasonably well understood and the endpoint has been known for a decade. A good reference is this, which lists minimal transition energy around 6e-19J, and minimal switch energy around ~2e-18J (after including local interconnect) for the end of CMOS scaling. The transition energy of around 6e-19J is a few OOM larger than the minimal Landauer bound, but that bound only applies for computations that take infinite time and or have a useless failure rate of 50%. For reliable digital logic, the minimal energy is closer to the electronvolt or 1e-19J (which is why chip voltages are roughly around 1V, whereas neurons compute semi-reliably at just a few times the minimal Landauer voltage).
So then if we do a very rough calculation for the upcoming RTX 4090, assuming 50% transistor activity rate, we get:
(450W / (0.5 * 7.6e10 * 2.2e9)) = 5.3e-18J, so only a few times above the predicted end-of-CMOS scaling energy, not a million times above. This is probably why all TSMC’s future nodes are all just 3X with some new letter, why Jensen (nvidia ceo) says moore’s law is dead, etc. (Intel meanwhile says it’s not dead yet, but they are 4 or 5 years behind TSMC, so it’s only true for them)
Now maybe there will be future miracles, but they seem to buy at best only a few OOM, which is the remaining gap to the brain, which really is pushing at the energy limit.
I think I’m understanding where you’re coming from a bit more now, thanks. So, when I wrote:
The H100, taken as a whole, is on the order of a million times away from the Landauer limit at its operating temperature.
My intended meaning in context was “taking the asspull as an assumption, the abstract computational thing an H100 is doing that is relevant to ML (without caring about the hardware used to accomplish it, and implicitly assuming a move to more ML-optimized architectures) is very roughly 6 OOMs off the absolute lower bound, while granting that the lower bound is not achievable due to the spherical-cow violating details like error rates and not-just-logic and the rest.”
I gather it sounded to you more like, “we can make a GPU with a similar architecture a million times more energy efficient through Moore-like advancements.”
I’ll see if I can come up with some edits that keep it concise while being clearer.
That said, I am dubious about the predicted CMOS scaling endpoint implying a 4090 is only about 2-3x away from minimal switching+interconnect costs. That’s very hard to square with the fact that the 4090 is shipping with extreme clock rates and supporting voltages to meet the expectations of a halo gaming product. Due to the nonlinear curves involved, I wouldn’t be surprised if a 4090 underclocked and undervolted to its efficiency sweetspot is very close to, or even below, the predicted minimum. (Something like a 6700XT on TSMC 7 nm at 1500 mhz is ~2.5x more efficient per clock than at 2600 mhz.)
Here’s an attempt with Apple’s M1 Ultra, on a similar N5 process: Total draw: ~180W (60W CPU + 120W GPU) Transistor count: 114B GPU clock: 1.3ghz E/P core maximum frequency: 2.064ghz/3.228ghz
In the absence of good numbers for the cpu/gpu split, let’s assume it’s similar to the difference between a 7950x (13.1B) and a 4080 12GB (35.8B), or around 27% CPU. Assuming all CPU cores are running at the conservative E core maximum frequency of 2.064ghz:
It is more apparent on low power systems like phones. The iPhone 14 uses an Apple A16 with 2 P cores at a maximum of 3.46ghz, 4 E cores at 2.02ghz, and a 5 core GPU at unstated max frequency and manufactured on the N4 process. It has 16B transistors. Getting good numbers for the transistor split, frequency, and power consumption is even trickier here, but a good estimate for power seems to be a maximum of around 3.6W.
For simplicity, I’ll assume the whole chip is at only 1ghz, which is likely around where the GPU would run (and I doubt the SoC would ever be permitted to max out all frequencies at once for any significant period of time; that would probably be a bit higher than 3.6W). Repeating the exercise:
3.6 / (0.5 * 16e9 * 1e9) = 4.5e-19
I could be missing something or making some simple mistake here, but it seems like even pretty conservative estimates suggest the ~2e-18 bound isn’t correct. One way out would be that the amount of dark silicon is actually much higher than 50%, but that would in turn suggest there’s more headroom.
The results do fit with my expectations, though- if we were extremely close to a wall, I’d expect to see doubling times that are much, much longer (to the point of ‘doubling times’ being a silly way to even talk about progress). Instead, indications are that N3 will probably give another bump in efficiency and density fairly similar to that of N5 over N7 or N7 over N10. And, while trusting corpo-PR is always a little iffy, they do appear confident in N2 continuing the trend for power efficiency. (It is true the density increase from N3E to N2 is smaller for their first iteration on the node.)
(To be clear, again, I don’t think we can get another million times improvement in switching energy or overall efficiency in irreversible computers. It is going to slow down pretty hard, pretty soon, more than it already has, unless crazy miracles happen. Just not enough for it to matter for AGI, in my expectation.)
Hmm actually the 0.5 would assume full bright silicon, all 100% in use, because they only switch about half the time on average. So really it should be 0.5*a, where a is some activity factor, and I do think we are entering dark silicon era to some degree. Consider the nvidia tensorcores, and all the different bit pathways they have. Those may share some sub parts, but seems unlikely they share everything.
Also CPUs tend to be mostly SRAM cache, which has much lower activity level.
Out of curiosity:
What rough probability do you assign to a 10x improvement in efficiency for ML tasks (GPU or not) within 20 years?
What rough probability do you assign to a 100x improvement in efficiency for ML tasks (GPU or not) within 20 years?
My understanding is that we actually agree about the important parts of hardware, at least to the degree I think this question is even relevant to AGI at this point. I think we may disagree about the software side, I’m not sure.
I do agree I left a lot out of the hardware limits analysis, but largely because I don’t think it is enough to move the needle on the final conclusion (and the post is already pretty long!).
So assuming by ‘efficiency’ you mean training perf per $, then:
95% (Hopper/Lovelace will already provide 2x to 4x)
65%
Looks like we’re in almost perfect agreement!
I agree with you that we may already have enough compute, but I called this out mostly because it struck me as quick/sloppy overconfident analysis (or perhaps we just disagree on the physics) which distracted from your other arguments.
Scanning through your other post, I don’t think we disagree on the physics regarding ML-relevant compute. It is a quick and simplistic analysis, yes- my intent there was really just to say “hardware bottlenecks sure don’t look like they’re going to arrive soon enough to matter, given the rest of this stuff.” The exact amount of headroom we have left and everything that goes into that estimation just didn’t seem worth including given the length and low impact. (I would have chosen differently if those details changed the conclusion of the section.)
I am curious as to what part felt overconfident to you. I attempted to lampshade the nature of the calculations with stuff like “napkin math” and “asspull,” but there may be some other phrasing that indicated undue certainty.
I have gone back and forth about the value of the section- it’s one of the least important for the actual argument, but it seemed worth it to have a brief blurb. It’s possible that I just don’t quite understand the vibe you’re getting from it.
For example, in your original comment:
I was a little confused by this, because it sounds like my post made you think I think Moore’s law will continue unhindered or that there are no massive problems in the next 20 years for semiconductor manufacturing. In reality, I agree, that set of technologies is in the latter stages of its sigmoid. (For example, the Q&A about me underplaying the slowdown in Moore’s law.)
If there’s some misleading wording somewhere that I can fix easily, I’d like to.
Yeah it was the asspull part, which I mostly noticed as Landauer, and this:
Well instead of using the asspull math, you can look at the analysis in the engineering literature. At a really high level, you can just look at the end of the ITRS roadmap. The scaling physics for CMOS are reasonably well understood and the endpoint has been known for a decade. A good reference is this, which lists minimal transition energy around 6e-19J, and minimal switch energy around ~2e-18J (after including local interconnect) for the end of CMOS scaling. The transition energy of around 6e-19J is a few OOM larger than the minimal Landauer bound, but that bound only applies for computations that take infinite time and or have a useless failure rate of 50%. For reliable digital logic, the minimal energy is closer to the electronvolt or 1e-19J (which is why chip voltages are roughly around 1V, whereas neurons compute semi-reliably at just a few times the minimal Landauer voltage).
So then if we do a very rough calculation for the upcoming RTX 4090, assuming 50% transistor activity rate, we get:
(450W / (0.5 * 7.6e10 * 2.2e9)) = 5.3e-18J, so only a few times above the predicted end-of-CMOS scaling energy, not a million times above. This is probably why all TSMC’s future nodes are all just 3X with some new letter, why Jensen (nvidia ceo) says moore’s law is dead, etc. (Intel meanwhile says it’s not dead yet, but they are 4 or 5 years behind TSMC, so it’s only true for them)
Now maybe there will be future miracles, but they seem to buy at best only a few OOM, which is the remaining gap to the brain, which really is pushing at the energy limit.
I think I’m understanding where you’re coming from a bit more now, thanks. So, when I wrote:
My intended meaning in context was “taking the asspull as an assumption, the abstract computational thing an H100 is doing that is relevant to ML (without caring about the hardware used to accomplish it, and implicitly assuming a move to more ML-optimized architectures) is very roughly 6 OOMs off the absolute lower bound, while granting that the lower bound is not achievable due to the spherical-cow violating details like error rates and not-just-logic and the rest.”
I gather it sounded to you more like, “we can make a GPU with a similar architecture a million times more energy efficient through Moore-like advancements.”
I’ll see if I can come up with some edits that keep it concise while being clearer.
That said, I am dubious about the predicted CMOS scaling endpoint implying a 4090 is only about 2-3x away from minimal switching+interconnect costs. That’s very hard to square with the fact that the 4090 is shipping with extreme clock rates and supporting voltages to meet the expectations of a halo gaming product. Due to the nonlinear curves involved, I wouldn’t be surprised if a 4090 underclocked and undervolted to its efficiency sweetspot is very close to, or even below, the predicted minimum. (Something like a 6700XT on TSMC 7 nm at 1500 mhz is ~2.5x more efficient per clock than at 2600 mhz.)
Here’s an attempt with Apple’s M1 Ultra, on a similar N5 process:
Total draw: ~180W (60W CPU + 120W GPU)
Transistor count: 114B
GPU clock: 1.3ghz
E/P core maximum frequency: 2.064ghz/3.228ghz
In the absence of good numbers for the cpu/gpu split, let’s assume it’s similar to the difference between a 7950x (13.1B) and a 4080 12GB (35.8B), or around 27% CPU. Assuming all CPU cores are running at the conservative E core maximum frequency of 2.064ghz:
CPU: 60 / (0.5 * 0.27 * 114e9 * 2.064e9) = 1.89e-18
GPU: 120 / (0.5 * 0.73 * 114e9 * 1.3e9) = 2.21e-18
It is more apparent on low power systems like phones. The iPhone 14 uses an Apple A16 with 2 P cores at a maximum of 3.46ghz, 4 E cores at 2.02ghz, and a 5 core GPU at unstated max frequency and manufactured on the N4 process. It has 16B transistors. Getting good numbers for the transistor split, frequency, and power consumption is even trickier here, but a good estimate for power seems to be a maximum of around 3.6W.
For simplicity, I’ll assume the whole chip is at only 1ghz, which is likely around where the GPU would run (and I doubt the SoC would ever be permitted to max out all frequencies at once for any significant period of time; that would probably be a bit higher than 3.6W). Repeating the exercise:
3.6 / (0.5 * 16e9 * 1e9) = 4.5e-19
I could be missing something or making some simple mistake here, but it seems like even pretty conservative estimates suggest the ~2e-18 bound isn’t correct. One way out would be that the amount of dark silicon is actually much higher than 50%, but that would in turn suggest there’s more headroom.
The results do fit with my expectations, though- if we were extremely close to a wall, I’d expect to see doubling times that are much, much longer (to the point of ‘doubling times’ being a silly way to even talk about progress). Instead, indications are that N3 will probably give another bump in efficiency and density fairly similar to that of N5 over N7 or N7 over N10. And, while trusting corpo-PR is always a little iffy, they do appear confident in N2 continuing the trend for power efficiency. (It is true the density increase from N3E to N2 is smaller for their first iteration on the node.)
(To be clear, again, I don’t think we can get another million times improvement in switching energy or overall efficiency in irreversible computers. It is going to slow down pretty hard, pretty soon, more than it already has, unless crazy miracles happen. Just not enough for it to matter for AGI, in my expectation.)
Hmm actually the 0.5 would assume full bright silicon, all 100% in use, because they only switch about half the time on average. So really it should be 0.5*a, where a is some activity factor, and I do think we are entering dark silicon era to some degree. Consider the nvidia tensorcores, and all the different bit pathways they have. Those may share some sub parts, but seems unlikely they share everything.
Also CPUs tend to be mostly SRAM cache, which has much lower activity level.