I’ll admit I’m not very certain in the following claims, but here’s my rough model:
The AGI labs focus on downscaling the inference-time compute costs inasmuch as this makes their models useful for producing revenue streams or PR. They don’t focus on it as much beyond that; it’s a waste of their researchers’ time. The amount of compute at OpenAI’s internal disposal is well, well in excess of even o3′s demands.
This means an AGI lab improves the computational efficiency of a given model up to the point at which they could sell it/at which it looks impressive, then drop that pursuit. And making e. g. GPT-4 10x cheaper isn’t a particularly interesting pursuit, so they don’t focus on that.
Most of the models of the past several years have only been announced near the point at which they were ready to be released as products. I. e.: past the point at which they’ve been made compute-efficient enough to be released.
E. g., they’ve spent months post-training GPT-4, and we only hear about stuff like Sonnet 3.5.1 or Gemini Deep Research once it’s already out.
o3, uncharacteristically, is announced well in advance of its release. I’m getting the sense, in fact, that we might be seeing the raw bleeding edge of the current AI state-of-the-art for the first time in a while. Perhaps because OpenAI felt the need to urgently counter the “data wall” narratives.
Which means that, unlike the previous AIs-as-products releases, o3 has undergone ~no compute-efficiency improvements, and there’s a lot of low-hanging fruit there.
Or perhaps any part of this story is false. As I said, I haven’t been keeping a close enough eye on this part of things to be confident in it. But it’s my current weakly-held strong view.
So far as I know, it is not the case that OpenAI had a slower-but-equally-functional version of GPT4 many months before announcement/release. What they did have is GPT4 itself, months before; but they did not have a slower version. They didn’t release a substantially distilled version. For example, the highest estimate I’ve seen is that they trained a 2-trillion-parameter model. And the lowest estimate I’ve seen is that they released a 200-billion-parameter model. If both are true, then they distilled 10x… but it’s much more likely that only one is true, and that they released what they trained, distilling later. (The parameter count is proportional to the inference cost.)
Previously, delays in release were believed to be about post-training improvements (e.g. RLHF) or safety testing. Sure, there were possibly mild infrastructure optimizations before release, but mostly to scale to many users; the models didn’t shrink.
This is for language models. For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don’t have a 1000x faster version, despite much interest in one.
For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don’t have a 1000x faster version, despite much interest in one.
I don’t know the details, but whatever the NN thing (derived from Lc0, a clone of AlphaZero) inside current Stockfish is can play on a laptop GPU.
And even if AlphaZero derivatives didn’t gain 3OOMs by themselves it doesn’t update me much that that’s something particularly hard. Google itself has no interest at improving it further and just moved on to MuZero, to AlphaFold etc.
The NN thing inside stockfish is called the NNUE, and it is a small neural net used for evaluation (no policy head for choosing moves). The clever part of it is that it is “efficiently updatable” (i.e. if you’ve computed the evaluation of one position, and now you move a single piece, getting the updated evaluation for the new position is cheap). This feature allows it to be used quickly with CPUs; stockfish doesn’t really use GPUs normally (I think this is because moving the data on/off the GPU is itself too slow! Stockfish wants to evaluate 10 million nodes per second or something.)
This NNUE is not directly comparable to alphazero and isn’t really a descendant of it (except in the sense that they both use neural nets; but as far as neural net architectures go, stockfish’s NNUE and alphazero’s policy network are just about as different as they could possibly be.)
I don’t think it can be argued that we’ve improved 1000x in compute over alphazero’s design, and I do think there’s been significant interest in this (e.g. MuZero was an attempt at improving alphazero, the chess and Go communities coded up Leela, and there’s been a bunch of effort made to get better game playing bots in general).
I’ll admit I’m not very certain in the following claims, but here’s my rough model:
The AGI labs focus on downscaling the inference-time compute costs inasmuch as this makes their models useful for producing revenue streams or PR. They don’t focus on it as much beyond that; it’s a waste of their researchers’ time. The amount of compute at OpenAI’s internal disposal is well, well in excess of even o3′s demands.
This means an AGI lab improves the computational efficiency of a given model up to the point at which they could sell it/at which it looks impressive, then drop that pursuit. And making e. g. GPT-4 10x cheaper isn’t a particularly interesting pursuit, so they don’t focus on that.
Most of the models of the past several years have only been announced near the point at which they were ready to be released as products. I. e.: past the point at which they’ve been made compute-efficient enough to be released.
E. g., they’ve spent months post-training GPT-4, and we only hear about stuff like Sonnet 3.5.1 or Gemini Deep Research once it’s already out.
o3, uncharacteristically, is announced well in advance of its release. I’m getting the sense, in fact, that we might be seeing the raw bleeding edge of the current AI state-of-the-art for the first time in a while. Perhaps because OpenAI felt the need to urgently counter the “data wall” narratives.
Which means that, unlike the previous AIs-as-products releases, o3 has undergone ~no compute-efficiency improvements, and there’s a lot of low-hanging fruit there.
Or perhaps any part of this story is false. As I said, I haven’t been keeping a close enough eye on this part of things to be confident in it. But it’s my current weakly-held strong view.
So far as I know, it is not the case that OpenAI had a slower-but-equally-functional version of GPT4 many months before announcement/release. What they did have is GPT4 itself, months before; but they did not have a slower version. They didn’t release a substantially distilled version. For example, the highest estimate I’ve seen is that they trained a 2-trillion-parameter model. And the lowest estimate I’ve seen is that they released a 200-billion-parameter model. If both are true, then they distilled 10x… but it’s much more likely that only one is true, and that they released what they trained, distilling later. (The parameter count is proportional to the inference cost.)
Previously, delays in release were believed to be about post-training improvements (e.g. RLHF) or safety testing. Sure, there were possibly mild infrastructure optimizations before release, but mostly to scale to many users; the models didn’t shrink.
This is for language models. For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don’t have a 1000x faster version, despite much interest in one.
I don’t know the details, but whatever the NN thing (derived from Lc0, a clone of AlphaZero) inside current Stockfish is can play on a laptop GPU.
And even if AlphaZero derivatives didn’t gain 3OOMs by themselves it doesn’t update me much that that’s something particularly hard. Google itself has no interest at improving it further and just moved on to MuZero, to AlphaFold etc.
The NN thing inside stockfish is called the NNUE, and it is a small neural net used for evaluation (no policy head for choosing moves). The clever part of it is that it is “efficiently updatable” (i.e. if you’ve computed the evaluation of one position, and now you move a single piece, getting the updated evaluation for the new position is cheap). This feature allows it to be used quickly with CPUs; stockfish doesn’t really use GPUs normally (I think this is because moving the data on/off the GPU is itself too slow! Stockfish wants to evaluate 10 million nodes per second or something.)
This NNUE is not directly comparable to alphazero and isn’t really a descendant of it (except in the sense that they both use neural nets; but as far as neural net architectures go, stockfish’s NNUE and alphazero’s policy network are just about as different as they could possibly be.)
I don’t think it can be argued that we’ve improved 1000x in compute over alphazero’s design, and I do think there’s been significant interest in this (e.g. MuZero was an attempt at improving alphazero, the chess and Go communities coded up Leela, and there’s been a bunch of effort made to get better game playing bots in general).