Some papers are already using implicit process based supervision. That’s where the reward model guesses how “good” a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it’s negotiated between the policy and reward model. It’s not clear to me if this scales as well as explicit process supervision, but it’s certainly easier to find labels.
In rStar-Math they did implicit process supervision. Although I don’t think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation.
There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn’t better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current inconsistently-candid-CoT.
I also think that this whole episodes show how hard to it to maintain and algorithmic advantage. DeepSeek R1 came how long after o3? The lack of algorithmic advantage predicts multiple winners in the AGI race.
I agree because:
Some papers are already using implicit process based supervision. That’s where the reward model guesses how “good” a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it’s negotiated between the policy and reward model. It’s not clear to me if this scales as well as explicit process supervision, but it’s certainly easier to find labels.
In rStar-Math they did implicit process supervision. Although I don’t think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation.
Outcome-Refining Process Supervision for Code Generation did it too
There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn’t better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current inconsistently-candid-CoT.
I also think that this whole episodes show how hard to it to maintain and algorithmic advantage. DeepSeek R1 came how long after o3? The lack of algorithmic advantage predicts multiple winners in the AGI race.