I do think that some of the deep learning revolution turned out to be kind of compute bottlenecked, but I don’t believe this is currently that true anymore
I had kind of the exact opposite impression of compute bottlenecks (that deep learning was not meaingfully compute bottlenecked until very recently). OpenAI apparently has a bunch of products and probably also experiments that are literally just waiting for H100s to arrive. Probably this is mainly due to the massive demand for inference, but still, this seems like a kind actual hardware bottleneck that is pretty new for the field of DL. It kind of has a parallel to Bitcoin mining technology, where the ability to get the latest-gen ASICs first was (still is?) a big factor in miner profitability.
Huh, maybe. My current guess is that things aren’t really “compute bottlenecked”. It’s just the case that we now have profitable enough AI that we really want to have better compute. But if we didn’t get cheaper compute, we would still see performance increase a lot as we find ways to improve compute-efficiency the same way we’ve been improving it a lot over the past 5-10 years, and that for any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.
I’d say usually bottlenecks aren’t absolute, but instead quantifiable and flexible based on costs, time, etc.?
One could say that we’ve reached the threshold where we’re bottlenecked on inference-compute, whereas previously talk of compute bottlenecks was about training-compute.
This seems to matter for some FOOM scenarios since e.g. it limits the FOOM that can be achieved by self-duplicating.
But the fact that AI companies are trying their hardest to scale up compute, and are also actively researching more compute-efficient algorithms, means IMO that the inference-compute bottleneck will be short-lived.
For any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.
This is true, but as a picture of a past, this is underselling compute by focusing on cost of compute rather than compute itself.
-- Cost of compute improved by… less than 44x, let’s say, if we use a reasonable guess based off Moore’s law. So algo efficiency was more important than that cost per FLOP going down.
So just looking at cost of compute is somewhat misleading. Cost per FLOP went down, but the amount spent went up from just dollars on a training run to tens of thousands of dollars on a training run.
-- Algo efficiency improved 44x, if we use the OpenAI efficiency baseline for AlexNet
It is ridiculous to interpret this as some general algo efficiency improvement—it’s a specific improvement in a specific measure (flops) which doesn’t even directly translate into equivalent wall-clock time performance, and is/was already encapsulated in sparsity techniques.
There has been extremely little improvement in general algorithm efficiency, compared to hardware improvement.
Not disagreeing. Am still interested in a longer-form view of why the 44x estimate overestimates, if you’re interested in writing it (think you mentioned looking into it one time).
It’s like starting with an uncompressed image, and then compressing it farther each year using different compressors (which aren’t even the best known, as there were better compressors available known earlier or in the beginning), and then measuring the data size reduction over time and claiming it as a form of “general software efficiency improvement”. It’s nothing remotely comparable to moore’s law progress (which more generally actually improves a wide variety of software).
I had kind of the exact opposite impression of compute bottlenecks (that deep learning was not meaingfully compute bottlenecked until very recently). OpenAI apparently has a bunch of products and probably also experiments that are literally just waiting for H100s to arrive. Probably this is mainly due to the massive demand for inference, but still, this seems like a kind actual hardware bottleneck that is pretty new for the field of DL. It kind of has a parallel to Bitcoin mining technology, where the ability to get the latest-gen ASICs first was (still is?) a big factor in miner profitability.
Huh, maybe. My current guess is that things aren’t really “compute bottlenecked”. It’s just the case that we now have profitable enough AI that we really want to have better compute. But if we didn’t get cheaper compute, we would still see performance increase a lot as we find ways to improve compute-efficiency the same way we’ve been improving it a lot over the past 5-10 years, and that for any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.
I’d say usually bottlenecks aren’t absolute, but instead quantifiable and flexible based on costs, time, etc.?
One could say that we’ve reached the threshold where we’re bottlenecked on inference-compute, whereas previously talk of compute bottlenecks was about training-compute.
This seems to matter for some FOOM scenarios since e.g. it limits the FOOM that can be achieved by self-duplicating.
But the fact that AI companies are trying their hardest to scale up compute, and are also actively researching more compute-efficient algorithms, means IMO that the inference-compute bottleneck will be short-lived.
In what sense are they “not trying their hardest”?
I think you inserted an extra “not”.
Oh gosh, how did I hallucinate that?
Maybe you’re an LLM.
This is true, but as a picture of a past, this is underselling compute by focusing on cost of compute rather than compute itself.
I.e., in the period between 2012 and 2020:
-- Algo efficiency improved 44x, if we use the OpenAI efficiency baseline for AlexNet
-- Cost of compute improved by… less than 44x, let’s say, if we use a reasonable guess based off Moore’s law. So algo efficiency was more important than that cost per FLOP going down.
-- But, using EpochAI’s estimates for a 6 month doubling time, total compute per training run increased > 10,000x.
So just looking at cost of compute is somewhat misleading. Cost per FLOP went down, but the amount spent went up from just dollars on a training run to tens of thousands of dollars on a training run.
It is ridiculous to interpret this as some general algo efficiency improvement—it’s a specific improvement in a specific measure (flops) which doesn’t even directly translate into equivalent wall-clock time performance, and is/was already encapsulated in sparsity techniques.
There has been extremely little improvement in general algorithm efficiency, compared to hardware improvement.
Not disagreeing. Am still interested in a longer-form view of why the 44x estimate overestimates, if you’re interested in writing it (think you mentioned looking into it one time).
It’s like starting with an uncompressed image, and then compressing it farther each year using different compressors (which aren’t even the best known, as there were better compressors available known earlier or in the beginning), and then measuring the data size reduction over time and claiming it as a form of “general software efficiency improvement”. It’s nothing remotely comparable to moore’s law progress (which more generally actually improves a wide variety of software).