The clock speed of a GPU is indeed meaningful: there is a clock inside the GPU that provides some signal that’s periodic at a frequency of ~ 1 GHz. However, the corresponding period of ~ 1 nanosecond does not correspond to the timescale of any useful computations done by the GPU.
True, but isn’t this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms − 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.
The human brain is estimated to do the computational equivalent of around 1e15 FLOP/s.
“Computational equivalence” here seems pretty fraught as an analogy, perhaps more so than the clock speed <-> neuron firing speed analogy.
In the context of digital circuits, FLOP/s is a measure of an outward-facing performance characteristic of a system or component: a chip that can do 1 million FLOP/s means that every second it can take 2 million floats as input, perform some arithmetic operation on them (pairwise) and return 1 million results.
(Whether the “arithmetic operations” are FP64 multiplication or FP8 addition will of course have a big effect on the top-level number you can report in your datasheet or marketing material, but a good benchmark suite will give you detailed breakdowns for each type.)
But even the top-line number is (at least theoretically) a very concrete measure of something that you can actually get out of the system. In contrast, when used in “computational equivalence” estimates of the brain, FLOP/s are (somewhat dubiously, IMO) repurposed as a measure of what the system is doing internally.
So even if the 1e15 “computational equivalence” number is right, AND all of that computation is irreducibly a part of the high-level cognitive algorithm that the brain is carrying out, all that means is that it necessarily takes at least 1e15 FLOP/s to run or simulate a brain at neuron-level fidelity. It doesn’t mean that you can’t get the same high-level outputs of that brain through some other much more computationally efficient process.
(Note that “more efficient process” need not be high-level algorithms improvements that look radically different from the original brain-based computation; the efficiencies could come entirely from low-level optimizations such as not running parts of the simulation that won’t affect the final output, or running them at lower precision, or with caching, etc.)
Separately, I think your sequential tokens per second calculation actually does show that LLMs are already “thinking” (in some sense) several OOM faster than humans? 50 tokens/sec is about 5 lines of code per second, or 18,000 lines of code per hour. Setting aside quality, that’s easily 100x more than the average human developer can usually write (unassisted) in an hour, unless they’re writing something very boilerplate or greenfield.
(The comparison gets even more stark when you consider longer timelines, since an LLM can generate code 24⁄7 without getting tired: 18,000 lines / hr is ~150 million lines in a year.)
The main issue with current LLMs (which somewhat invalidates this whole comparison) is that they can pretty much only generate boilerplate or greenfield stuff. Generating large volumes of mostly-useless / probably-nonsense boilerplate quickly doesn’t necessarily correspond to “thinking faster” than humans, but that’s mostly because current LLMs are only barely doing anything that can rightfully be called thinking in the first place.
So I agree with you that the claim that current AIs are thinking faster than humans is somewhat fraught. However, I think there are multiple strong reasons to expect that future AIs will think much faster than humans, and the clock speed <-> neuron firing analogy is one of them.
True, but isn’t this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms − 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.
Yes, which is why you should not be using that metric in the first place.
But even the top-line number is (at least theoretically) a very concrete measure of something that you can actually get out of the system. In contrast, when used in “computational equivalence” estimates of the brain, FLOP/s are (somewhat dubiously, IMO) repurposed as a measure of what the system is doing internally.
Will you still be saying this if future neural networks are running on specialized hardware that, much like the brain, can only execute forward or backward passes of a particular network architecture? I think talking about FLOP/s in this setting makes a lot of sense, because we know the capabilities of neural networks are closely linked to how much training and inference compute they use, but maybe you see some problem with this also?
So even if the 1e15 “computational equivalence” number is right, AND all of that computation is irreducibly a part of the high-level cognitive algorithm that the brain is carrying out, all that means is that it necessarily takes at least 1e15 FLOP/s to run or simulate a brain at neuron-level fidelity. It doesn’t mean that you can’t get the same high-level outputs of that brain through some other much more computationally efficient process.
I agree, but even if we think future software progress will enable us to get a GPT-4 level model with 10x smaller inference compute, it still makes sense to care about what inference with GPT-4 costs today. The same is true of the brain.
Separately, I think your sequential tokens per second calculation actually does show that LLMs are already “thinking” (in some sense) several OOM faster than humans? 50 tokens/sec is about 5 lines of code per second, or 18,000 lines of code per hour. Setting aside quality, that’s easily 100x more than the average human developer can usually write (unassisted) in an hour, unless they’re writing something very boilerplate or greenfield.
Yes, but they are not thinking 7 OOM faster. My claim is not AIs can’t think faster than humans, indeed, I think they can. However, current AIs are not thinking faster than humans when you take into account the “quality” of the thinking as well as the rate at which it happens, which is why I think FLOP/s is a more useful measure here than token latency. GPT-4 has higher token latency than GPT-3.5, but I think it’s fair to say that GPT-4 is the model that “thinks faster” when asked to accomplish some nontrivial cognitive task.
The main issue with current LLMs (which somewhat invalidates this whole comparison) is that they can pretty much only generate boilerplate or greenfield stuff. Generating large volumes of mostly-useless / probably-nonsense boilerplate quickly doesn’t necessarily correspond to “thinking faster” than humans, but that’s mostly because current LLMs are only barely doing anything that can rightfully be called thinking in the first place.
Exactly, and the empirical trend is that there is a quality-token latency tradeoff: if you want to generate tokens at random, it’s very easy to do that at extremely high speed. As you increase your demands on the quality you want these tokens to have, you must take more time per token to generate them. So it’s not fair to compare a model like GPT-4 to the human brain on grounds of “token latency”: I maintain that throughput comparisons (training compute and inference compute) are going to be more informative in general, though software differences between ML models and the brain can still make it not straightforward to interpret those comparisons.
True, but isn’t this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms − 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.
Yes, which is why you should not be using that metric in the first place.
Well, clock speed is a pretty fundamental parameter in digital circuit design. For a fixed circuit, running it at a 1000x slower clock frequency means an exactly 1000x slowdown. (Real integrated circuits are usually designed to operate in a specific clock frequency range that’s not that wide, but in theory you could scale any chip design running at 1 GHz to run at 1 KHz or even lower pretty easily, on a much lower power budget.)
Clock speeds between different chips aren’t directly comparable, since architecture and various kinds of parallelism matter too, but it’s still good indicator of what kind of regime you’re in, e.g. high-powered / actively-cooled datacenter vs. some ultra low power embedded microcontroller.
Another way of looking at it is power density: below ~5 GHz or so (where integrated circuits start to run into fundamental physical limits), there’s a pretty direct tradeoff between power consumption and clock speed.
A modern high-end IC (e.g. a desktop CPU) has a power density on the order of 100 W / cm^2. This is over a tiny thickness; assuming 1 mm you get a 3-D power dissipation of 1000 W / cm^3 for a CPU vs. human brains that dissipate ~10 W / 1000 cm^3 = 0.01 watts / cm^3.
The point of this BOTEC is that there are several orders of magnitude of “headroom” available to run whatever the computation the brain is performing at a much higher power density, which, all else being equal, usually implies a massive serial speed up (because the way you take advantage of higher power densities in IC design is usually by simply cranking up the clock speed, at least until that starts to cause issues and you have to resort to other tricks like parallelism and speculative execution).
The fact that ICs are bumping into fundamental physical limits on clock speed suggests that they are already much closer to the theoretical maximum power densities permitted by physics, at least for silicon-based computing. This further implies that, if and when someone does figure out how to run the actual brain computations that matter in silicon, they will be able to run those computations at many OOM higher power densities (and thus OOM higher serial speeds, by default) pretty easily, since biological brains are very very far from any kind of fundamental limit on power density. I think the clock speed <-> neuron firing speed analogy is a good way of way of summarizing this whole chain of inference.
Will you still be saying this if future neural networks are running on specialized hardware that, much like the brain, can only execute forward or backward passes of a particular network architecture? I think talking about FLOP/s in this setting makes a lot of sense, because we know the capabilities of neural networks are closely linked to how much training and inference compute they use, but maybe you see some problem with this also?
I think energy and power consumption are the safest and most rigorous way to compare and bound the amount of computation that AIs are doing vs. humans. (This unfortunately implies a pretty strict upper bound, since we have several billion existence proofs that ~20 W is more than sufficient for lethally powerful cognition at runtime, at least once you’ve invested enough energy in the training process.)
current AIs are not thinking faster than humans [...] GPT-4 has higher token latency than GPT-3.5, but I think it’s fair to say that GPT-4 is the model that “thinks faster”
This notion of thinking speed depends on the difficulty of a task. If one of the systems can’t solve a problem at all, it’s neither faster nor slower. If both systems can solve a problem, we can compare the time they take. In that sense, current LLMs are 1-2 OOMs faster than humans at the tasks both can solve, and much cheaper.
Old chess AIs were slower than humans good at chess. If future AIs can take advantage of search to improve quality, they might again get slower than humans at sufficiently difficult tasks, while simultaneously being faster than humans at easier tasks.
Sure, but in that case I would not say the AI thinks faster than humans, I would say the AI is faster than humans at a specific range of tasks where the AI can do those tasks in a “reasonable” amount of time.
As I’ve said elsewhere, there is a quality or breadth vs serial speed tradeoff in ML systems: a system that only does one narrow and simple task can do that task at a high serial speed, but as you make systems more general and get them to handle more complex tasks, serial speed tends to fall. The same logic that people are using to claim GPT-4 thinks faster than humans should also lead them to think a calculator thinks faster than GPT-4, which is an unproductive way to use the one-dimensional abstraction of “thinking faster vs. slower”.
You might ask “Well, why use that abstraction at all? Why not talk about how fast the AIs can do specific tasks instead of trying to come up with some general notion of if their thinking is faster or slower?” I think a big reason is that people typically claim the faster “cognitive speed” of AIs can have impacts such as “accelerating the pace of history”, and I’m trying to argue that the case for such an effect is not as trivial to make as some people seem to think.
This notion of thinking speed makes sense for large classes of tasks, not just specific tasks. And a natural class of tasks to focus on is the harder tasks among all the tasks both systems can solve.
So in this sense a calculator is indeed much faster than GPT-4, and GPT-4 is 2 OOMs faster than humans. An autonomous research AGI is capable of autonomous research, so its speed can be compared to humans at that class of tasks.
AI accelerates the pace of history only when it’s capable of making the same kind of progress as humans in advancing history, at which point we need to compare their speed to that of humans at that activity (class of tasks). Currently AIs are not capable of that at all. If hypothetically 1e28 training FLOPs LLMs become capable of autonomous research (with scaffolding that doesn’t incur too much latency overhead), we can expect that they’ll be 1-2 OOMs faster than humans, because we know how they work. Thus it makes sense to claim that 1e28 FLOPs LLMs will accelerate history if they can do research autonomously. If AIs need to rely on extensive search on top of LLMs to get there, or if they can’t do it at all, we can instead predict that they don’t accelerate history, again based on what we know of how they work.
True, but isn’t this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms − 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.
“Computational equivalence” here seems pretty fraught as an analogy, perhaps more so than the clock speed <-> neuron firing speed analogy.
In the context of digital circuits, FLOP/s is a measure of an outward-facing performance characteristic of a system or component: a chip that can do 1 million FLOP/s means that every second it can take 2 million floats as input, perform some arithmetic operation on them (pairwise) and return 1 million results.
(Whether the “arithmetic operations” are FP64 multiplication or FP8 addition will of course have a big effect on the top-level number you can report in your datasheet or marketing material, but a good benchmark suite will give you detailed breakdowns for each type.)
But even the top-line number is (at least theoretically) a very concrete measure of something that you can actually get out of the system. In contrast, when used in “computational equivalence” estimates of the brain, FLOP/s are (somewhat dubiously, IMO) repurposed as a measure of what the system is doing internally.
So even if the 1e15 “computational equivalence” number is right, AND all of that computation is irreducibly a part of the high-level cognitive algorithm that the brain is carrying out, all that means is that it necessarily takes at least 1e15 FLOP/s to run or simulate a brain at neuron-level fidelity. It doesn’t mean that you can’t get the same high-level outputs of that brain through some other much more computationally efficient process.
(Note that “more efficient process” need not be high-level algorithms improvements that look radically different from the original brain-based computation; the efficiencies could come entirely from low-level optimizations such as not running parts of the simulation that won’t affect the final output, or running them at lower precision, or with caching, etc.)
Separately, I think your sequential tokens per second calculation actually does show that LLMs are already “thinking” (in some sense) several OOM faster than humans? 50 tokens/sec is about 5 lines of code per second, or 18,000 lines of code per hour. Setting aside quality, that’s easily 100x more than the average human developer can usually write (unassisted) in an hour, unless they’re writing something very boilerplate or greenfield.
(The comparison gets even more stark when you consider longer timelines, since an LLM can generate code 24⁄7 without getting tired: 18,000 lines / hr is ~150 million lines in a year.)
The main issue with current LLMs (which somewhat invalidates this whole comparison) is that they can pretty much only generate boilerplate or greenfield stuff. Generating large volumes of mostly-useless / probably-nonsense boilerplate quickly doesn’t necessarily correspond to “thinking faster” than humans, but that’s mostly because current LLMs are only barely doing anything that can rightfully be called thinking in the first place.
So I agree with you that the claim that current AIs are thinking faster than humans is somewhat fraught. However, I think there are multiple strong reasons to expect that future AIs will think much faster than humans, and the clock speed <-> neuron firing analogy is one of them.
Yes, which is why you should not be using that metric in the first place.
Will you still be saying this if future neural networks are running on specialized hardware that, much like the brain, can only execute forward or backward passes of a particular network architecture? I think talking about FLOP/s in this setting makes a lot of sense, because we know the capabilities of neural networks are closely linked to how much training and inference compute they use, but maybe you see some problem with this also?
I agree, but even if we think future software progress will enable us to get a GPT-4 level model with 10x smaller inference compute, it still makes sense to care about what inference with GPT-4 costs today. The same is true of the brain.
Yes, but they are not thinking 7 OOM faster. My claim is not AIs can’t think faster than humans, indeed, I think they can. However, current AIs are not thinking faster than humans when you take into account the “quality” of the thinking as well as the rate at which it happens, which is why I think FLOP/s is a more useful measure here than token latency. GPT-4 has higher token latency than GPT-3.5, but I think it’s fair to say that GPT-4 is the model that “thinks faster” when asked to accomplish some nontrivial cognitive task.
Exactly, and the empirical trend is that there is a quality-token latency tradeoff: if you want to generate tokens at random, it’s very easy to do that at extremely high speed. As you increase your demands on the quality you want these tokens to have, you must take more time per token to generate them. So it’s not fair to compare a model like GPT-4 to the human brain on grounds of “token latency”: I maintain that throughput comparisons (training compute and inference compute) are going to be more informative in general, though software differences between ML models and the brain can still make it not straightforward to interpret those comparisons.
Well, clock speed is a pretty fundamental parameter in digital circuit design. For a fixed circuit, running it at a 1000x slower clock frequency means an exactly 1000x slowdown. (Real integrated circuits are usually designed to operate in a specific clock frequency range that’s not that wide, but in theory you could scale any chip design running at 1 GHz to run at 1 KHz or even lower pretty easily, on a much lower power budget.)
Clock speeds between different chips aren’t directly comparable, since architecture and various kinds of parallelism matter too, but it’s still good indicator of what kind of regime you’re in, e.g. high-powered / actively-cooled datacenter vs. some ultra low power embedded microcontroller.
Another way of looking at it is power density: below ~5 GHz or so (where integrated circuits start to run into fundamental physical limits), there’s a pretty direct tradeoff between power consumption and clock speed.
A modern high-end IC (e.g. a desktop CPU) has a power density on the order of 100 W / cm^2. This is over a tiny thickness; assuming 1 mm you get a 3-D power dissipation of 1000 W / cm^3 for a CPU vs. human brains that dissipate ~10 W / 1000 cm^3 = 0.01 watts / cm^3.
The point of this BOTEC is that there are several orders of magnitude of “headroom” available to run whatever the computation the brain is performing at a much higher power density, which, all else being equal, usually implies a massive serial speed up (because the way you take advantage of higher power densities in IC design is usually by simply cranking up the clock speed, at least until that starts to cause issues and you have to resort to other tricks like parallelism and speculative execution).
The fact that ICs are bumping into fundamental physical limits on clock speed suggests that they are already much closer to the theoretical maximum power densities permitted by physics, at least for silicon-based computing. This further implies that, if and when someone does figure out how to run the actual brain computations that matter in silicon, they will be able to run those computations at many OOM higher power densities (and thus OOM higher serial speeds, by default) pretty easily, since biological brains are very very far from any kind of fundamental limit on power density. I think the clock speed <-> neuron firing speed analogy is a good way of way of summarizing this whole chain of inference.
I think energy and power consumption are the safest and most rigorous way to compare and bound the amount of computation that AIs are doing vs. humans. (This unfortunately implies a pretty strict upper bound, since we have several billion existence proofs that ~20 W is more than sufficient for lethally powerful cognition at runtime, at least once you’ve invested enough energy in the training process.)
This notion of thinking speed depends on the difficulty of a task. If one of the systems can’t solve a problem at all, it’s neither faster nor slower. If both systems can solve a problem, we can compare the time they take. In that sense, current LLMs are 1-2 OOMs faster than humans at the tasks both can solve, and much cheaper.
Old chess AIs were slower than humans good at chess. If future AIs can take advantage of search to improve quality, they might again get slower than humans at sufficiently difficult tasks, while simultaneously being faster than humans at easier tasks.
Sure, but in that case I would not say the AI thinks faster than humans, I would say the AI is faster than humans at a specific range of tasks where the AI can do those tasks in a “reasonable” amount of time.
As I’ve said elsewhere, there is a quality or breadth vs serial speed tradeoff in ML systems: a system that only does one narrow and simple task can do that task at a high serial speed, but as you make systems more general and get them to handle more complex tasks, serial speed tends to fall. The same logic that people are using to claim GPT-4 thinks faster than humans should also lead them to think a calculator thinks faster than GPT-4, which is an unproductive way to use the one-dimensional abstraction of “thinking faster vs. slower”.
You might ask “Well, why use that abstraction at all? Why not talk about how fast the AIs can do specific tasks instead of trying to come up with some general notion of if their thinking is faster or slower?” I think a big reason is that people typically claim the faster “cognitive speed” of AIs can have impacts such as “accelerating the pace of history”, and I’m trying to argue that the case for such an effect is not as trivial to make as some people seem to think.
This notion of thinking speed makes sense for large classes of tasks, not just specific tasks. And a natural class of tasks to focus on is the harder tasks among all the tasks both systems can solve.
So in this sense a calculator is indeed much faster than GPT-4, and GPT-4 is 2 OOMs faster than humans. An autonomous research AGI is capable of autonomous research, so its speed can be compared to humans at that class of tasks.
AI accelerates the pace of history only when it’s capable of making the same kind of progress as humans in advancing history, at which point we need to compare their speed to that of humans at that activity (class of tasks). Currently AIs are not capable of that at all. If hypothetically 1e28 training FLOPs LLMs become capable of autonomous research (with scaffolding that doesn’t incur too much latency overhead), we can expect that they’ll be 1-2 OOMs faster than humans, because we know how they work. Thus it makes sense to claim that 1e28 FLOPs LLMs will accelerate history if they can do research autonomously. If AIs need to rely on extensive search on top of LLMs to get there, or if they can’t do it at all, we can instead predict that they don’t accelerate history, again based on what we know of how they work.