Gerald Monroe comments on Processor clock speeds are not how fast AIs think

Gerald Monroe 29 Jan 2024 19:22 UTC
49 points
6
Epistemics: I’m an ML platform engineer who works on inference accelerators.
What you said is correct, however, look at how the brain does it.
There are approximately ~1000 physical living synaptic ‘wires’ to each target cell. Electrical action potentials travel down the ‘wires’, jumping from node to node, until they reach a synapse, where the sending cell sends a quantity of neurotransmitters across a gap to the receiving cell.
Each neurotransmitters causes a + or—electrical voltage change at the receiver.
So you can abstract this as ([1] * <+- delta from neurotransmitter>) + (receiving cell voltage)
The receiving cell then does an activation function that fundamentally integrates many of these inputs, ~1000 on average, and then either emits an output pulse if a voltage theshold is reached.
This is both an analog and a digital system, and it’s a common misbelief to think it has infinite precision because it does temporal integration. It does not, and therefore a finite precision binary computer can model this with no loss of performance.
We can measure the ‘clock speed’ of this method at about 1 kilohertz.
Middle Summary: using dedicated hardware for every circuit, the brain runs at 1 kilohertz.
A similar function is to use dedicated hardware to model a brain-like AI system. Today you would use a dedicated multiply-accumulate unit for every connection in a neural network graph, aka if it were 175B at 16-bit weights, you would use 175 billion 16-bit MACs. You also obviously would have dedicated circuitry for each activation calculation, which run in parallel.
If the clock speed of the H100 is 1755 MHz, and it can do (1,979 teraFLOPS)/2, and each MAC is a single clock MAC, then it has 563,818 MAC units organized into it’s RTX units.
So to run GPT 3.5 with dedicated hardware you would need 310,384 times more silicon area.
A true human level intelligence likely requires many more weights.
Summary: you can build dedicated hardware that will run AI models substantially faster than the brain, it seems approximately 1-2 million times faster if you run at the clock speeds of the H100 and assume the brain is 1-2 khz. However it will take up a lot more silicon, the rumor of 128 H100s per GPT-4 instance means that if you need 310,384 times more silicon to reach the above speeds, you still need 2424 times as much hardware as is currently being used, and it’s dedicated ASICs.
For AI ‘risk’ arguments : if this is what it requires to run at these speeds, there are 3 takeaways:
1. AI models that “escape”, unless there are large pools of the above hardware, true neural processors, in vulnerable places on the internet, will run so slow as to be effectively dead. (Because the “military” AIs working with humans will have this kind of hardware to hunt them down with)
2. It is important to monitor the production and deployment of hardware capable of running AI models millions of times faster
3. It is physically possible for humans to build hardware to host AI on that does run many times faster, peaking at ~1 million times faster, using 2024 fabrication technology. It just won’t be cheap. Hardware would be many ICs that look like the below:
What links here?
- Gerald Monroe's comment on Processor clock speeds are not how fast AIs think by Ege Erdil (29 Jan 2024 21:09 UTC; 6 points)
- the gears to ascension 29 Jan 2024 19:37 UTC
  5 points
  3
  Parent
  Importantly, by “faster” here we are talking about latency. the ops per sec possible on a cerebras style chip will not be that drastically higher than a similar area of silicon for a gpu—it’s conceivable you could get 100x higher ops/sec due to locality if I remember my numbers right. But with drastically lower latency, the model only has to decide which part of itself to run at full frequency, so it would in fact be able to think drastically faster than humans.
  
  though, you can also do a similar trick on GPUs, and it wouldn’t be too hard to design an architecture that uses a recently released block to do it by having a small very-high-frequency portion of the network, since there’s a recent block design that is already implemented correctly to pull it off.
- Ege Erdil 29 Jan 2024 19:54 UTC
  4 points
  −2
  Parent
  Don’t global clock speeds have to go down as die area goes up due to the speed of light constraint?
  
  For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that’s ~ 2e9 times larger than H100′s die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million times faster than the 1 kHz you attribute to the human brain. If you need multiple round trips for a single clock, the frequencies will get even lower.
  
  Maybe when the clock frequencies get this low, you’re dissipating so little heat that you can go 3D without worrying too much about heating issues and that buys you something. Still, your argument here doesn’t seem that obvious to me, especially if you consider the fact that one round trip for one clock is extremely optimistic if you’re trying to do all MACs at once. Remember that GPT-3 is a sequential model; you can’t perform all the ops in one clock because later layers need to know what the earlier layers have computed.
  
  Overall I think your comment here is quite speculative. It may or may not be true, I think we’ll see, but people shouldn’t treat it as if this is obviously something that’s feasible to do.
  - Gerald Monroe 29 Jan 2024 20:34 UTC
    10 points
    1
    Parent
    Don’t global clock speeds have to go down as die area goes up due to the speed of light constraint?
    Yes if you use 1 die with 1 clock domain, they would. Modern chips don’t.
    For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that’s ~ 2e9 times larger than H100′s die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million times faster than the 1 kHz you attribute to the human brain.
    So actually you would make something like 10,000 dies, the MAC units spread between them. Clusters of dies are calculating a single layer of the network. There are many optical interconnects between the modules, using silicon photonics, like this:
    I was focused on the actual individual synapse part of it. The key thing to realize that the MAC is the slowest step, multiplications are complicated and take a lot of silicon, and I am implicitly assuming all other steps are cheap and fast. Note that it’s also a pipelined system with enough hardware there aren’t any stalls, the moment a MAC is done it’s processing the next input on the next clock, and so on. It’s what hardware engineers would do if they had an unlimited budget for silicon.
    Maybe when the clock frequencies get this low, you’re dissipating so little heat that you can go 3D without worrying too much about heating issues and that buys you something. Still, your argument here doesn’t seem that obvious to me, especially if you consider the fact that one round trip for one clock is extremely optimistic if you’re trying to do all MACs at once. Remember that GPT-3 is a sequential model; you can’t perform all the ops in one clock because later layers need to know what the earlier layers have computed.
    No no no. Each layer runs 1 million times faster. If a human brain needs 10 seconds to think of something, that’s enough time for the AI equivalent to compute the activations of 10 million layers, or many forward passes over the information, reflection on the updated memory buffer from the prior steps, and so on.
    Or the easier way to view it is : say a human responds to an event in 300 ms.
    Optic nerve is 13 m/s. Average length 45 mm. 3.46 milliseconds to reach the visual cortex.
    Then for a signal to reach a limb, 100 milliseconds. So we have 187 milliseconds for neural network layers.
    Assume the synapses run at 1 khz total, so that means in theory there can be 187 layers of neural network between the visual cortex and the output nerves going to alimb.
    So to run this 1 million times faster, we use hollow core optical fibers, and we need to travel 1 meter total of cabling. That will take 3.35 nanoseconds, which is 89 million times faster.
    Ok, can we evaluate a 187 layer neural network 1 million times faster than 187 milliseconds?
    Well if we take 1 clock for the MACs, we compute the activations with 9 clocks of delay including clock domain syncs, you need 10 ghz for 1 million fold speedup.
    But oh wait, we have better wiring. So actually we have 299 milliseconds of budget for 187 neural network layers.
    This means we only need 6.25 ghz. Or we may be able to do a layer in less than 10 clocks.
    Overall I think your comment here is quite speculative. It may or may not be true, I think we’ll see, but people shouldn’t treat it as if this is obviously something that’s feasible to do.
    There is nothing speculative or novel here about the overall idea of using ASICs to accelerate compute tasks. The reason it won’t be done this way is because you can extract a lot more economic value for your (silicon + power consumption) with slower systems that share hardware.
    I mean right now there’s no value at all in doing this, the AI model is just going to go off the rails and be deeply wrong within a few minutes at current speeds, or get stuck in a loop it doesn’t know it’s in because of short context. Faster serial speed isn’t helpful.
    In the future, you have to realize that even if the brain of some robotic machine can respond 1 million times faster, physics do not allow the hardware to move that fast, or the cameras to even register a new frame by accumulating photons that fast. It’s not a useful speed. It also applies similarly to R&D tasks—at much lower speedups the task will be limited by the physical world.
  - Tao Lin 30 Jan 2024 1:04 UTC
    1 point
    0
    Parent
    no clock speed stays the same, but clock cycle latency of communication between regions increases. Just like CPUs require more clock cycles to access memory than they used to.
    - Ege Erdil 30 Jan 2024 1:35 UTC
      6 points
      0
      Parent
      Sure, but from the point of view of per token latency that’s going to be a similar effect, no?
  - [ ]
    [deleted]
- Donald Hobson 13 Feb 2024 22:45 UTC
  2 points
  0
  Parent
  (Because the “military” AIs working with humans will have this kind of hardware to hunt them down with)
  You need to make a lot of extra assumptions about the world for this reasoning to work.
  These “military” AI’s need to exist. And they need to be reasonably loosely chained. If their safety rules are so strict they can’t do anything, they can’t do anything however fast they don’t do it. They need to be actively trying to do their job, as opposed to playing along for the humans but not really caring. They need to be smart enough. If the escaped AI uses some trick that the “military” AI just can’t comprehend, then it fails to comprehend again and again, very fast.
  - Gerald Monroe 13 Feb 2024 22:52 UTC
    4 points
    0
    Parent
    Of course. Implicitly I am assuming that intelligent algorithms have diminishing returns on architectural complexity. So if the “military” owned model is simple with an architecture + training suite controlled and understood by humans, the assumption is the “free” model cannot be effectively that much more intelligent with a better architecture if it lacks compute by a factor of millions. That greater intelligence is a scale dependent phenomenon.
    
    This is so far consistent with empirical evidence. Do you happen to know of evidence to challenge this assumption? As far as I know with LLM experiments, there are tweaks to architecture but the main determinant for benchmark performance is model+data scale (which are interdependent), and non transformer architectures seem to show similar emergent properties.
    - Donald Hobson 13 Feb 2024 23:34 UTC
      2 points
      2
      Parent
      As far as I know with LLM experiments, there are tweaks to architecture but the main determinant for benchmark performance is model+data scale (which are interdependent), and non transformer architectures seem to show similar emergent properties.
      So within the rather limited subspace of LLM architectures, all architectures are about the same.
      Ie once you ignore the huge space of architectures that just ignore the data and squander compute, then architecture doesn’t matter. Ie we have one broad family of techniques, (with gradient decent, text prediction etc) and anything in that family is about equally good. And anything outside basically doesn’t work at all.
      This looks to me to be fairly strong evidence that you can’t get a large improvement in performance by randomly bumbling around with small architecture tweaks to existing models.
      Does this say anything about whether a fundamentally different approach might do better? No. We can’t tell that from this evidence. Although looking at the human brain, we can see it seems to be more data efficient than LLM’s. And we know that in theory models could be Much more data efficient. Addition is very simple. Solomonov induction would have it as a major hypothesis after seeing only a couple of examples. But GPT2 saw loads of arithmetic in training, and still couldn’t reliably do it.
      So I think LLM architectures form a flat bottomed local semi-minima (minimal in at least most dimensions). It’s hard to get big improvements just by tweaking it. (We are applying enough grad student descent to ensure that) but nowhere near global optimal.
      Suppose everything is really data bottlenecked, and the slower AI has a more data efficient algorithm. Or maybe the slower AI knows how to make synthetic data, and the human trained AI doesn’t.
      - Gerald Monroe 14 Feb 2024 0:15 UTC
        4 points
        2
        Parent
        There’s one interesting technical detail. Human brain uses heavy irregular sparsity. This is where most possible connections between layers have no connection—zero weight. On a gpu, there is limited performance improvement from sparsity. This is because the hardware subunits can only calculate sparse activations if the memory address patterns are regular. Irregular sparsity doesn’t have hardware support.
        
        Future neural processors will support full sparsity. This will allow them to run 100x faster probably with the same amount of silicon (way less macs but you have layer activation multicast units)- but only on new specialized hardware.
        
        It’s possible that whatever neural architectures that are found using recursion—that leave the state space of llms or human brains—will have similar compute requirements. That they will be functionally useless in current hardware, running thousands of times faster on purpose built hardware.
        
        Same idea though. I don’t see why “the military” can’t do recursion using their own AIs and use custom hardware to outcompete any “rogues”.
        
        Also it provides a simple way to keep control of AI : track the location of custom hardware, know your customer, etc.
        
        I suspect if AI is anything like computer graphics there will be at least 5-10 paradigm shifts to new architectures that need updated hardware to run, obsoleting everything deployed, before settling in something that is optimal. Flops are not actually fungible and Turing complete doesn’t mean your training run will complete this century.
        Donald Hobson 14 Feb 2024 1:12 UTC
        4 points
        2
        Parent
        Same idea though. I don’t see why “the military” can’t do recursion using their own AIs and use custom hardware to outcompete any “rogues”.
        One of the deep fundamental reasons here is alignment failures. Either the “military” isn’t trying very hard, or humans know they haven’t solved alignment. Humans know they can’t build a functional “military” AI, all they can do is make another rouge AI. Or the humans don’t know that, and the military AI is another rouge AI.
        For this military AI to be fighting other AI’s on behalf of humans, a lot of alignment work has to go right.
        The second deep reason is that recursive self improvement is a strong positive feedback loop. It isn’t clear how strong, but it could be Very strong. So suppose the first AI undergoes a recursive improvement FOOM. And it happens that the rouge AI gets there before any military. Perhaps because the creators of the military AI are taking their time to check the alignment theory.
        Positive feedback loops tend to amplify small differences.
        Also, about all those hardware differences. A smart AI might well come up with a design that efficiently uses old hardware. Oh, and this is all playing out in the future, not now. Maybe the custom AI hardware is everywhere by the time this is happening.
        I suspect if AI is anything like computer graphics there will be at least 5-10 paradigm shifts to new architectures that need updated hardware to run, obsoleting everything deployed, before settling in something that is optimal. Flops are not actually fungible and Turing complete doesn’t mean your training run will complete this century.
        This is with humans doing the research. Humans invent new algorithms more slowly than new chips are made. So it makes sense to adjust the algorithm to the chip. If the AI can do software research far faster than any human, adjusting the software to the hardware (an approach that humans use a lot throughout most of computing) becomes an even better idea.
        Gerald Monroe 14 Feb 2024 1:47 UTC
        4 points
        0
        Parent
        
        This is with humans doing the research. Humans invent new algorithms more slowly than new chips are made. So it makes sense to adjust the algorithm to the chip. If the AI can do software research far faster than any human, adjusting the software to the hardware (an approach that humans use a lot throughout most of computing) becomes an even better idea.
        
        Note that if you are seeking an improved network architecture and you need it to work on a limited family of chips, this is constraining your search. You may not be able to find a meaningful improvement over the sota with that constraint in place, regardless of your intelligence level. Something like sparsity, in memory compute, neural architecture (this is where the chip is structured like the network it is modeling with dedicated hardware) can offer colossal speedups.
        Donald Hobson 14 Feb 2024 16:46 UTC
        4 points
        2
        Parent
        this is constraining your search. You may not be able to find a meaningful improvement over the sota with that constraint in place, regardless of your intelligence level.
        I mean the space of algorithms that can run on an existing chip is pretty huge. Yes it is a constraint. And it’s theoretically possible that the search could return no solutions, if the SOTA was achieved with Much better chips, or was near optimal already, or the agent doing the search wasn’t much smarter than us.
        For example, there are techniques that decompose a matrix into its largest eigenvectors. Which works great without needing sparse hardware.