But—regardless of Yudkowsky’s current position—it still remains that you’d have been extremely surprised by the last decade’s use of compute if you had believed him, and much less surprised if you had believed Hanson.
I think you are pointing towards something real here, but also, algorithmic progress is currently outpacing compute growth by quite a bit, at least according to the Epoch AI estimates I remember. I also expect algorithmic progress to increase in importance.
I do think that some of the deep learning revolution turned out to be kind of compute bottlenecked, but I don’t believe this is currently that true anymore, though I think it’s kind of messy (since it’s unclear what fraction of compute-optimizations themselves were bottlenecked on making it cheaper to experiment by having cheaper compute).
I do think that some of the deep learning revolution turned out to be kind of compute bottlenecked, but I don’t believe this is currently that true anymore
I had kind of the exact opposite impression of compute bottlenecks (that deep learning was not meaingfully compute bottlenecked until very recently). OpenAI apparently has a bunch of products and probably also experiments that are literally just waiting for H100s to arrive. Probably this is mainly due to the massive demand for inference, but still, this seems like a kind actual hardware bottleneck that is pretty new for the field of DL. It kind of has a parallel to Bitcoin mining technology, where the ability to get the latest-gen ASICs first was (still is?) a big factor in miner profitability.
Huh, maybe. My current guess is that things aren’t really “compute bottlenecked”. It’s just the case that we now have profitable enough AI that we really want to have better compute. But if we didn’t get cheaper compute, we would still see performance increase a lot as we find ways to improve compute-efficiency the same way we’ve been improving it a lot over the past 5-10 years, and that for any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.
I’d say usually bottlenecks aren’t absolute, but instead quantifiable and flexible based on costs, time, etc.?
One could say that we’ve reached the threshold where we’re bottlenecked on inference-compute, whereas previously talk of compute bottlenecks was about training-compute.
This seems to matter for some FOOM scenarios since e.g. it limits the FOOM that can be achieved by self-duplicating.
But the fact that AI companies are trying their hardest to scale up compute, and are also actively researching more compute-efficient algorithms, means IMO that the inference-compute bottleneck will be short-lived.
For any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.
This is true, but as a picture of a past, this is underselling compute by focusing on cost of compute rather than compute itself.
-- Cost of compute improved by… less than 44x, let’s say, if we use a reasonable guess based off Moore’s law. So algo efficiency was more important than that cost per FLOP going down.
So just looking at cost of compute is somewhat misleading. Cost per FLOP went down, but the amount spent went up from just dollars on a training run to tens of thousands of dollars on a training run.
-- Algo efficiency improved 44x, if we use the OpenAI efficiency baseline for AlexNet
It is ridiculous to interpret this as some general algo efficiency improvement—it’s a specific improvement in a specific measure (flops) which doesn’t even directly translate into equivalent wall-clock time performance, and is/was already encapsulated in sparsity techniques.
There has been extremely little improvement in general algorithm efficiency, compared to hardware improvement.
Not disagreeing. Am still interested in a longer-form view of why the 44x estimate overestimates, if you’re interested in writing it (think you mentioned looking into it one time).
It’s like starting with an uncompressed image, and then compressing it farther each year using different compressors (which aren’t even the best known, as there were better compressors available known earlier or in the beginning), and then measuring the data size reduction over time and claiming it as a form of “general software efficiency improvement”. It’s nothing remotely comparable to moore’s law progress (which more generally actually improves a wide variety of software).
algorithmic progress is currently outpacing compute growth by quite a bit
This is not right, at least in computer vision.
They seem to be the same order of magnitude.
Physical compute has growth at 0.6 OOM/year and physical compute requirements have decreased at 0.1 to 1.0 OOM/year, see a summary here or a in depth investigation here
Another relevant quote
Algorithmic progress explains roughly 45% of performance improvements in image classification, and most of this occurs through improving compute-efficiency.
Davidson’s takeoff model illustrates this point, where a “software singularity” happens for some parameter settings due to software not being restrained to the same degree by capital inputs.
I would point out however that our current understanding of how software progress happens is somewhat poor. Experimentation is definitely a big component of software progress, and it is often understated in LW.
I think you are pointing towards something real here, but also, algorithmic progress is currently outpacing compute growth by quite a bit, at least according to the Epoch AI estimates I remember. I also expect algorithmic progress to increase in importance.
I do think that some of the deep learning revolution turned out to be kind of compute bottlenecked, but I don’t believe this is currently that true anymore, though I think it’s kind of messy (since it’s unclear what fraction of compute-optimizations themselves were bottlenecked on making it cheaper to experiment by having cheaper compute).
I had kind of the exact opposite impression of compute bottlenecks (that deep learning was not meaingfully compute bottlenecked until very recently). OpenAI apparently has a bunch of products and probably also experiments that are literally just waiting for H100s to arrive. Probably this is mainly due to the massive demand for inference, but still, this seems like a kind actual hardware bottleneck that is pretty new for the field of DL. It kind of has a parallel to Bitcoin mining technology, where the ability to get the latest-gen ASICs first was (still is?) a big factor in miner profitability.
Huh, maybe. My current guess is that things aren’t really “compute bottlenecked”. It’s just the case that we now have profitable enough AI that we really want to have better compute. But if we didn’t get cheaper compute, we would still see performance increase a lot as we find ways to improve compute-efficiency the same way we’ve been improving it a lot over the past 5-10 years, and that for any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.
I’d say usually bottlenecks aren’t absolute, but instead quantifiable and flexible based on costs, time, etc.?
One could say that we’ve reached the threshold where we’re bottlenecked on inference-compute, whereas previously talk of compute bottlenecks was about training-compute.
This seems to matter for some FOOM scenarios since e.g. it limits the FOOM that can be achieved by self-duplicating.
But the fact that AI companies are trying their hardest to scale up compute, and are also actively researching more compute-efficient algorithms, means IMO that the inference-compute bottleneck will be short-lived.
In what sense are they “not trying their hardest”?
I think you inserted an extra “not”.
Oh gosh, how did I hallucinate that?
Maybe you’re an LLM.
This is true, but as a picture of a past, this is underselling compute by focusing on cost of compute rather than compute itself.
I.e., in the period between 2012 and 2020:
-- Algo efficiency improved 44x, if we use the OpenAI efficiency baseline for AlexNet
-- Cost of compute improved by… less than 44x, let’s say, if we use a reasonable guess based off Moore’s law. So algo efficiency was more important than that cost per FLOP going down.
-- But, using EpochAI’s estimates for a 6 month doubling time, total compute per training run increased > 10,000x.
So just looking at cost of compute is somewhat misleading. Cost per FLOP went down, but the amount spent went up from just dollars on a training run to tens of thousands of dollars on a training run.
It is ridiculous to interpret this as some general algo efficiency improvement—it’s a specific improvement in a specific measure (flops) which doesn’t even directly translate into equivalent wall-clock time performance, and is/was already encapsulated in sparsity techniques.
There has been extremely little improvement in general algorithm efficiency, compared to hardware improvement.
Not disagreeing. Am still interested in a longer-form view of why the 44x estimate overestimates, if you’re interested in writing it (think you mentioned looking into it one time).
It’s like starting with an uncompressed image, and then compressing it farther each year using different compressors (which aren’t even the best known, as there were better compressors available known earlier or in the beginning), and then measuring the data size reduction over time and claiming it as a form of “general software efficiency improvement”. It’s nothing remotely comparable to moore’s law progress (which more generally actually improves a wide variety of software).
This is not right, at least in computer vision. They seem to be the same order of magnitude.
Physical compute has growth at 0.6 OOM/year and physical compute requirements have decreased at 0.1 to 1.0 OOM/year, see a summary here or a in depth investigation here
Another relevant quote
Cool, makes sense. Sounds like I remembered the upper bound for the algorithmic efficiency estimate. Thanks for correcting!
Algorithmic improvement has more FOOM potential. Hardware always has a lag.
That is to very basic approximation correct.
Davidson’s takeoff model illustrates this point, where a “software singularity” happens for some parameter settings due to software not being restrained to the same degree by capital inputs.
I would point out however that our current understanding of how software progress happens is somewhat poor. Experimentation is definitely a big component of software progress, and it is often understated in LW.
More research on this soon!