I’m going to make this slightly more legible, but not contribute new information.
Note that downthread, Jacob says:
the temp/size scaling part is not one of the more core claims so any correction there probably doesn’t change the conclusion much.
So if your interest is in Jacob’s arguments as they pertain to AI safety, this chunk of Jacob’s writings is probably not key for your understanding and you may want to focus your attention on other aspects.
Both Jacob and John agree on the obvious fact that active cooling is necessary for both the brain and for GPUs and a crucial aspect of their design.
Jacob:
Humans have evolved exceptional heat dissipation capability using the entire skin surface for evaporative cooling: a key adaption that supports both our exceptional long distance running ability, and our oversized brains...
Current 2021 gpus have a power density approaching 106 W / m2, which severely constrains the design to that of a thin 2D surface to allow for massive cooling through large heatsinks and fans...
John:
… brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip)..
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don’t even need to go to liquid helium for that.
Where they disagree is on two points:
Whether temperature of GPUs/brains scales with their surface area
Tractability of dealing with higher temperatures in scaled-down computers with active cooling
Jacob applies the Stefan-Boltzmann Law for black body radiators. In this model, temperature output scales with both energy and surface area:
T=(Meσ)14
Where Me is the power per unit surface area in W/m2, and σ is the Stefan-Boltzmann constant.
In comments, he rationalizes this choice by saying:
SB law describes the relationship to power density of a surface and corresponding temperature; it just gives you an idea of the equivalent temperature sans active cooling… That section was admittedly cut a little short, if I had more time/length it would justify a deeper dive into the physics of cooling and how much of a constraint that could be on the brain. You’re right though that the surface power density already describes what matters for cooling.
And downthread, he says:
I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space).
John advocates an alternative formula for heat flow:
Put all that together, and a more sensible formula would be:
qA=C1TSRR2=C2(TS−TE)R
… where:
R is radius of the system
A is surface area of thermal contact
q is heat flow out of system
TS is system temperature
TE is environment temperature (e.g. blood or heat sink temperature)
C1,C2 are constants with respect to system size and temperature
R cancels out. I’m also going to move A over to the other side, ignore the constants for our conceptual purposes, and cut out the middle part of the equation, leaving us with:
q=A(TS−TE)
In language, the heat flow out of the brain/GPU and into its cooling system (i.e. blood, a heatsink) is proportional to (area of contact) x (temperature difference).
At first glance, this would appear to also show that as you scale down, heat flow out of the system will decrease because there’ll be less available area for thermal contact. They key point is whether or not power consumption stays the same as you scale down.
Here is Jacob’s description of what happens to power consumption in GPUs as you scale down:
Current 2021 gpus have a power density approaching 106 W / m2, which severely constrains the design to that of a thin 2D surface...
This in turn constrains off-chip memory bandwidth to scale poorly: shrinking feature sizes with Moore’s Law by a factor of D increases transistor density by a factor of D2, but at best only increases 2d off-chip wire density by a factor of only D, and doesn’t directly help reduce wire energy cost at all.
And here is John’s model, where he clearly and crucially disagrees with Jacob on whether scaling down affects power consumption by shortening wires (relevant text is bolded in the quote above and below).
If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick.
So in fact scaling down is plausibly free, for purposes of heat management...
John also speaks to our ability to upgrade the cooling system:
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing.
Jacob doesn’t really talk about the limits of our ability to cool GPUs by upgrading the cooling system in this section, talking only of the thin 2D design of GPUs being motivated by a need to achieve “massive cooling through large heatsinks and fans.” Ctrl+F does not find the words “nitrogen” and “helium” in his post, and only the version of John’s comment in DaemonicSigil’s rebuttal to Jacob contains those terms. I am not sure if Jacob has expanded on his thoughts on the limits of higher-performance cooling elsewhere in his many comment replies.
So as far as I can tell, this is where the chain of claims and counter-claims is parked for now: a disagreement over power consumption changes as wires are shortened, and a disagreement on how practical it is for better cooling to allow further miniaturization even if scaling down does result in decreased heat flows and thus higher temperatures inside of the GPU. I expect there might be disagreement over whether scaling down will permit thinning of the surface (as John tentatively proposes).
Note that I am not an expert on these specific topics, although I have a biomedical engineering MS—my contribution here is gathering relevant quotes and attempting to show how they relate to each other in a way that’s more convenient than bouncing back and forth between posts. If I have made mistakes, please correct me and I will update this comment. If it’s fundamentally wrong, rather than having a couple local errors, I’ll probably just delete it as I don’t want to add noise to the discussion.
Strongly upvoted for taking the effort to sum up the debate between these two.
Just a brief comment from me, this part:
If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick.
Only makes sense in the context of a specified temperature range and wire material. I’m not sure if it was specified elsewhere or not.
A trivial example, A superconducting wire at 50 K will certainly not have it’s power consumption halved by scaling down a factor of 2, since it’s consumption is already practically zero (though not perfectly zero).
but wire volume requirements scale linearly with dimension. So if we ignore all the machinery required for cellular maintenance and cooling, this indicates the brain is at most about 100x larger than strictly necessary (in radius), and more likely only 10x larger.
However, even though the wiring energy scales linearly with radius, the surface area power density which crucially determines temperature scales with the inverse squared radius, and the minimal energy requirements for synaptic computation are radius invariant.
Radius there refers to brain radius, not wire radius. Unfortunately there are two meanings of wiring energy or wire energy. By ‘wiring energy’ above hopefully the context helps make clear that I meant the total energy used by brain wiring/interconnect, not the ‘wire energy’ in terms of energy per bit per nm, which is more of a fixed constant that depends on wire design tradeoffs.
So my model was/is that if we assume you could just take the brain and keep the same amount of compute (neurons/synapses/etc) but somehow shrink the entire radius by a factor of D, this would decrease total wiring energy by the same factor D by just shortening all the wires in the obvious way.
However, the surface power density scales with radius as 1/R2, so the net effect is that surface power density from interconnect scales with 1/R, ie it increases by a factor of D as you shrink by a factor of D, which thereby increases your cooling requirement (in terms of net heat flow) by the same factor D. But since the energy use of synaptic computation does not change, that just quickly dominates scaling with 1/R2 and thus D2.
In the section you quoted where I say:
This in turn constrains off-chip memory bandwidth to scale poorly: shrinking feature sizes with Moore’s Law by a factor of D increases transistor density by a factor of D2, but at best only increases 2d off-chip wire density by a factor of only D, and doesn’t directly help reduce wire energy cost at all.
Now I have moved to talking about 2D microchips, and “wire energy” here means the energy per bit per nm, which again doesn’t scale with device size. Also the D here is scaling in a somewhat different way—it is referring to reducing the size of all devices as in normal moore’s law shrinkage while holding the total chip size constant, increasing device density.
Looking back at that section I see numerous clarifications I would make now, and I would also perhaps focus more on the surface power density as a function of size, and perhaps analyze cooling requirements. However I think it is reasonably clear from the document that shrinking the brain radius by a factor of X increases the surface power density (and thus cooling requirements in terms of coolant flow at fixed coolant temp) from synaptic computation by X2 and from interconnect wiring by X.
In practice digital computers are approaching the limits of miniaturization and tend to be 2D for fast logic chips in part for cooling considerations as I describe. The cerebras wafer for example represents a monumental engineering advance in terms of getting power in and pumping heat out to a small volume, but they still use a 2D chip design, not 3D, because 2D allows you dramatically more surface area for pumping in power and out heat than a 3D design, at the sacrifice of much worse interconnect geometry scaling in terms of latency and bandwidth.
We can make 3D chips today and do, but that tends to be most viable for memory rather than logic, because memory has far lower power density (and the brain being neuromorphic is more like a giant memory chip with logic sprinkled around right next to each memory unit).
I’m going to make this slightly more legible, but not contribute new information.
Note that downthread, Jacob says:
So if your interest is in Jacob’s arguments as they pertain to AI safety, this chunk of Jacob’s writings is probably not key for your understanding and you may want to focus your attention on other aspects.
Both Jacob and John agree on the obvious fact that active cooling is necessary for both the brain and for GPUs and a crucial aspect of their design.
Jacob:
John:
Where they disagree is on two points:
Whether temperature of GPUs/brains scales with their surface area
Tractability of dealing with higher temperatures in scaled-down computers with active cooling
Jacob applies the Stefan-Boltzmann Law for black body radiators. In this model, temperature output scales with both energy and surface area:
In comments, he rationalizes this choice by saying:
And downthread, he says:
John advocates an alternative formula for heat flow:
R cancels out. I’m also going to move A over to the other side, ignore the constants for our conceptual purposes, and cut out the middle part of the equation, leaving us with:
q=A(TS−TE)
In language, the heat flow out of the brain/GPU and into its cooling system (i.e. blood, a heatsink) is proportional to (area of contact) x (temperature difference).
At first glance, this would appear to also show that as you scale down, heat flow out of the system will decrease because there’ll be less available area for thermal contact. They key point is whether or not power consumption stays the same as you scale down.
Here is Jacob’s description of what happens to power consumption in GPUs as you scale down:
And here is John’s model, where he clearly and crucially disagrees with Jacob on whether scaling down affects power consumption by shortening wires (relevant text is bolded in the quote above and below).
John also speaks to our ability to upgrade the cooling system:
Jacob doesn’t really talk about the limits of our ability to cool GPUs by upgrading the cooling system in this section, talking only of the thin 2D design of GPUs being motivated by a need to achieve “massive cooling through large heatsinks and fans.” Ctrl+F does not find the words “nitrogen” and “helium” in his post, and only the version of John’s comment in DaemonicSigil’s rebuttal to Jacob contains those terms. I am not sure if Jacob has expanded on his thoughts on the limits of higher-performance cooling elsewhere in his many comment replies.
So as far as I can tell, this is where the chain of claims and counter-claims is parked for now: a disagreement over power consumption changes as wires are shortened, and a disagreement on how practical it is for better cooling to allow further miniaturization even if scaling down does result in decreased heat flows and thus higher temperatures inside of the GPU. I expect there might be disagreement over whether scaling down will permit thinning of the surface (as John tentatively proposes).
Note that I am not an expert on these specific topics, although I have a biomedical engineering MS—my contribution here is gathering relevant quotes and attempting to show how they relate to each other in a way that’s more convenient than bouncing back and forth between posts. If I have made mistakes, please correct me and I will update this comment. If it’s fundamentally wrong, rather than having a couple local errors, I’ll probably just delete it as I don’t want to add noise to the discussion.
Strongly upvoted for taking the effort to sum up the debate between these two.
Just a brief comment from me, this part:
Only makes sense in the context of a specified temperature range and wire material. I’m not sure if it was specified elsewhere or not.
A trivial example, A superconducting wire at 50 K will certainly not have it’s power consumption halved by scaling down a factor of 2, since it’s consumption is already practically zero (though not perfectly zero).
This is all assuming that the power consumption for a wire is at-or-near the Landauer-based limit Jacob argued in his post.
Thank you for this effort. I will probably end up allocating a share of the prize money for effortposts like these too.
Thank you for the effort in organizing this conversation. I want to clarify a few points.
Around the very beginning of the density & temperature section I wrote:
Radius there refers to brain radius, not wire radius. Unfortunately there are two meanings of wiring energy or wire energy. By ‘wiring energy’ above hopefully the context helps make clear that I meant the total energy used by brain wiring/interconnect, not the ‘wire energy’ in terms of energy per bit per nm, which is more of a fixed constant that depends on wire design tradeoffs.
So my model was/is that if we assume you could just take the brain and keep the same amount of compute (neurons/synapses/etc) but somehow shrink the entire radius by a factor of D, this would decrease total wiring energy by the same factor D by just shortening all the wires in the obvious way.
However, the surface power density scales with radius as 1/R2, so the net effect is that surface power density from interconnect scales with 1/R, ie it increases by a factor of D as you shrink by a factor of D, which thereby increases your cooling requirement (in terms of net heat flow) by the same factor D. But since the energy use of synaptic computation does not change, that just quickly dominates scaling with 1/R2 and thus D2.
In the section you quoted where I say:
Now I have moved to talking about 2D microchips, and “wire energy” here means the energy per bit per nm, which again doesn’t scale with device size. Also the D here is scaling in a somewhat different way—it is referring to reducing the size of all devices as in normal moore’s law shrinkage while holding the total chip size constant, increasing device density.
Looking back at that section I see numerous clarifications I would make now, and I would also perhaps focus more on the surface power density as a function of size, and perhaps analyze cooling requirements. However I think it is reasonably clear from the document that shrinking the brain radius by a factor of X increases the surface power density (and thus cooling requirements in terms of coolant flow at fixed coolant temp) from synaptic computation by X2 and from interconnect wiring by X.
In practice digital computers are approaching the limits of miniaturization and tend to be 2D for fast logic chips in part for cooling considerations as I describe. The cerebras wafer for example represents a monumental engineering advance in terms of getting power in and pumping heat out to a small volume, but they still use a 2D chip design, not 3D, because 2D allows you dramatically more surface area for pumping in power and out heat than a 3D design, at the sacrifice of much worse interconnect geometry scaling in terms of latency and bandwidth.
We can make 3D chips today and do, but that tends to be most viable for memory rather than logic, because memory has far lower power density (and the brain being neuromorphic is more like a giant memory chip with logic sprinkled around right next to each memory unit).