Nagasaki, CEO of OpenAI Japan, said, “The AI model called ‘GPT Next’ that will be released in the future will evolve nearly 100 times based on past performance. Unlike traditional software, AI technology grows exponentially.”
The slide clearly states 2024 “GPT Next”. This 100 times increase probably does not refer to the scaling of computing resources, but rather to the effective computational volume + 2 OOMs, including improvements to the architecture and learning efficiency. GPT-4 NEXT, which will be released this year, is expected to be trained using a miniature version of Strawberry with roughly the same computational resources as GPT-4, with an effective computational load 100 times greater. Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 100k H100 compared to GPT-4 (EDIT: original tweet said 10k H100s, but that was a mistake), adding 10 times the computational resource scale, making it +3 OOMs, and is expected to be released sometime next year.
Note: Another OAI employee seemingly confirms this (I’ve followed them for a while, and they are working on inference).
Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 10k H100 compared to GPT-4, adding 10 times the computational resource scale
This implies successful use of FP8, if taken literally in a straightforward way. In BF16 an H100 gives 1e15 FLOP/s (in dense tensor compute). With 40% utilization over 10 months, 10K H100s give 1e26 FLOPs, which is only 5 times higher than the rumored 2e25 FLOPs of original GPT-4. To get to 10 times higher requires some 2x improvement, and the evident way to get that is by transitioning from BF16 to FP8. I think use of FP8 for training hasn’t been confirmed to be feasible at GPT-4 level scale (Llama-3-405B uses BF16), but if it does work, that’s a 2x compute increase for other models as well.
This text about Orion and 10K H100s only appears in the bioshok3 tweet itself, not in the quoted news article, so it’s unclear where the details come from. The “10 times the computational resource scale, making it +3 OOMs” hype within the same sentence also hurts credence in the numbers being accurate (10 times, 10K H100s, several months).
Another implication is that Orion is not the 100K H100s training run (that’s probably currently ongoing). Plausibly it’s an experiment with training on a significant amount of synthetic data. This suggests that the first 100K H100s training run won’t be experimenting with too much synthetic training data yet, at least in pre-training. The end of 2025 point for significant advancement in quality might then be referring to the possibility that Orion succeeds and its recipe is used in another 100K H100s scale run, which might be the first hypothetical model they intend to call “GPT-5”. The first 100K H100s run by itself (released in ~early 2025) would then be called “GPT-4.5o” or something (especially if Orion does succeed, so that “GPT-5” remains on track).
Surprisingly, there appears to be an additional clue for this in the wording: 2e26 BF16 FLOPs take 2.5 months on 100K H100s at 30% utilization, while the duration of “several months” is indicated by the text “数ヶ月” in the original tweet. GPT-4o explains it to mean
The Japanese term “数ヶ月” (すうかげつ, sūka getsu) translates to “several months” in English. It is an approximate term, generally referring to a period of 2 to 3 months but can sometimes extend to 4 or 5 months, depending on context. Essentially, it indicates a span of a few months without specifying an exact number.
So the interpretation that fits most is specifically 2-3 months (Claude says 2-4 months, Grok 3-4 months), close to what the calculation for 100K H100s predicts. And this is quite unlike the requisite 10 months with 10K H100s in FP8.
This text about Orion and 10K H100s only appears in the bioshok3 tweet itself, not in the quoted news article, so it’s unclear where the details come from.
My guess is that this is just false / hallucinated.
“Orion is 10x compute” seems plausible, “Orion was trained on only 10K H100s” does not seem plausible if it is actually supposed to be 10x raw compute. Around 50K H100s does seem plausible and would correspond to about 10x compute assuming a training duration similar to GPT-4.
Within this hypothetical, Orion didn’t necessarily merit the use of the largest training cluster, while time on 10K H100s is something mere money can buy without impacting other plans. GPT-4o is itself plausibly at 1e26 FLOPs level already, since H100s were around for more than a year before it came out (1e26 FLOPs is 5 months on 20K H100s). It might be significantly overtrained, or its early fusion multimodal nature might balloon the cost of effective intelligence. Gemini 1.0 Ultra, presumably also an early fusion model with rumored 1e26 FLOPs, similarly wasn’t much better than Mar 2023 GPT-4. Though Gemini 1.0 is plausibly dense, given how the Gemini 1.5 report stressed that 1.5 is MoE, so that might be a factor in how 1e26 FLOPs didn’t get it too much of an advantage.
So if GPT-4o is not far behind in terms of FLOPs, a 2e26 FLOPs Orion wouldn’t be a significant improvement unless the synthetic data aspect works very well, and so there would be no particular reason to rush it. On the other hand GPT-4o looks like something that needed to be done as fast as possible, and so the largest training cluster went to it and not Orion. The scaling timelines are dictated by building of largest training clusters, not by decisions about use of smaller training clusters.
News on the next OAI GPT release:
Note: Another OAI employee seemingly confirms this (I’ve followed them for a while, and they are working on inference).
This implies successful use of FP8, if taken literally in a straightforward way. In BF16 an H100 gives 1e15 FLOP/s (in dense tensor compute). With 40% utilization over 10 months, 10K H100s give 1e26 FLOPs, which is only 5 times higher than the rumored 2e25 FLOPs of original GPT-4. To get to 10 times higher requires some 2x improvement, and the evident way to get that is by transitioning from BF16 to FP8. I think use of FP8 for training hasn’t been confirmed to be feasible at GPT-4 level scale (Llama-3-405B uses BF16), but if it does work, that’s a 2x compute increase for other models as well.
This text about Orion and 10K H100s only appears in the bioshok3 tweet itself, not in the quoted news article, so it’s unclear where the details come from. The “10 times the computational resource scale, making it +3 OOMs” hype within the same sentence also hurts credence in the numbers being accurate (10 times, 10K H100s, several months).
Another implication is that Orion is not the 100K H100s training run (that’s probably currently ongoing). Plausibly it’s an experiment with training on a significant amount of synthetic data. This suggests that the first 100K H100s training run won’t be experimenting with too much synthetic training data yet, at least in pre-training. The end of 2025 point for significant advancement in quality might then be referring to the possibility that Orion succeeds and its recipe is used in another 100K H100s scale run, which might be the first hypothetical model they intend to call “GPT-5”. The first 100K H100s run by itself (released in ~early 2025) would then be called “GPT-4.5o” or something (especially if Orion does succeed, so that “GPT-5” remains on track).
Bioshok3 said in a later tweet that they were in any case mistaken about it being 10k H100s and it was actually 100k H100s: https://x.com/bioshok3/status/1831016098462081256
Surprisingly, there appears to be an additional clue for this in the wording: 2e26 BF16 FLOPs take 2.5 months on 100K H100s at 30% utilization, while the duration of “several months” is indicated by the text “数ヶ月” in the original tweet. GPT-4o explains it to mean
So the interpretation that fits most is specifically 2-3 months (Claude says 2-4 months, Grok 3-4 months), close to what the calculation for 100K H100s predicts. And this is quite unlike the requisite 10 months with 10K H100s in FP8.
My guess is that this is just false / hallucinated.
“Orion is 10x compute” seems plausible, “Orion was trained on only 10K H100s” does not seem plausible if it is actually supposed to be 10x raw compute. Around 50K H100s does seem plausible and would correspond to about 10x compute assuming a training duration similar to GPT-4.
Within this hypothetical, Orion didn’t necessarily merit the use of the largest training cluster, while time on 10K H100s is something mere money can buy without impacting other plans. GPT-4o is itself plausibly at 1e26 FLOPs level already, since H100s were around for more than a year before it came out (1e26 FLOPs is 5 months on 20K H100s). It might be significantly overtrained, or its early fusion multimodal nature might balloon the cost of effective intelligence. Gemini 1.0 Ultra, presumably also an early fusion model with rumored 1e26 FLOPs, similarly wasn’t much better than Mar 2023 GPT-4. Though Gemini 1.0 is plausibly dense, given how the Gemini 1.5 report stressed that 1.5 is MoE, so that might be a factor in how 1e26 FLOPs didn’t get it too much of an advantage.
So if GPT-4o is not far behind in terms of FLOPs, a 2e26 FLOPs Orion wouldn’t be a significant improvement unless the synthetic data aspect works very well, and so there would be no particular reason to rush it. On the other hand GPT-4o looks like something that needed to be done as fast as possible, and so the largest training cluster went to it and not Orion. The scaling timelines are dictated by building of largest training clusters, not by decisions about use of smaller training clusters.
This tweet also claims 10k H100s while citing the same article that doesn’t mention this.
Are you sure he is an OpenAi employee?