The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off.
Still, at least as long as base model effective training compute isn’t scaled another 1,000x (which is 2028-2029), this kind of RL training probably won’t generalize far enough without neural (LLM) rewards, which for now don’t let RL scale as much as with explicitly coded verifiers.
Not to convergence, the graphs in the paper keep going up. Which across the analogy might explain some of the change from o1 to o3 (the graphs in the o1 post also keep going up), though new graders coded for additional verifiable problems are no doubt a large part of it as well.
It seems like o1-mini is its own thing, might even start with a base model that’s unrelated to GPT-4o-mini (it might be using its own specialized pretraining data mix). So a clue about o3-mini data doesn’t obviously transfer to o3.
The numbering in GPT-N series advances with roughly 100x in raw compute at a time. If original GPT-4 is 2e25 FLOPs, then a GPT-5 would need 2e27 FLOPs, and a 100K H100s training system (like the Microsoft/OpenAI system at the site near the Goodyear airport) can only get you 3e26 FLOPs or so (in BF16 in 3 months). The initial Stargate training system at Abilene site, after it gets 300K B200s, will be 7x stronger than that, so will be able to get 2e27 FLOPs. Thus I expect GPT-5 in 2026 if OpenAI keeps following the naming convention, while the new 100K H100s model this year will be GPT-4.5o or something like that.