However, I expect RL on CoT to amount to “process-based supervision,” which seems inherently safer than “outcome-based supervision.”
I think the opposite is true; the RL on CoT that is already being done and will increasingly be done is going to be in significant part outcome-based (and a mixture of outcome-based and process-based feedback is actually less safe than just outcome-based IMO, because it makes the CoT less faithful)
Some papers are already using implicit process based supervision. That’s where the reward model guesses how “good” a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it’s negotiated between the policy and reward model. It’s not clear to me if this scales as well as explicit process supervision, but it’s certainly easier to find labels.
In rStar-Math they did implicit process supervision. Although I don’t think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation.
There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn’t better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current inconsistently-candid-CoT.
I also think that this whole episodes show how hard to it to maintain and algorithmic advantage. DeepSeek R1 came how long after o3? The lack of algorithmic advantage predicts multiple winners in the AGI race.
Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., “Hmmm”) as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.
I agree with this and would like to add that scaling along the inference-time axis seems to be more likely to rapidly push performance in certain closed-domain reasoning tasks far beyond human intelligence capabilities (likely already this year!) which will serve as a very convincing show of safety to many people and will lead to wide adoption of such models for intellectual task automation. But without the various forms of experiential and common-sense reasoning humans have, there’s no telling where and how such a “superhuman” model may catastrophically mess up simply because it doesn’t understand a lot of things any human being takes for granted. Given the current state of AI development, this strikes me as literally the shortest path to a paperclip maximizer. Well, maybe not that catastrophic, but hey, you neverknow.
In terms of how immediately it accelerates certain adoption-related risks, I don’t think this bodes particularly well. I would prefer a more evenly spread cognitive capability.
I think the opposite is true; the RL on CoT that is already being done and will increasingly be done is going to be in significant part outcome-based (and a mixture of outcome-based and process-based feedback is actually less safe than just outcome-based IMO, because it makes the CoT less faithful)
I think all of the following:
process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it’s using
combining process-based feedback with outcome-based feedback:
pushes extra hard against legibility because it incentivises obfuscating said strategies[1]
unclear sign wrt faithfulness
or at least has the potential to, depending on the details.
I agree because:
Some papers are already using implicit process based supervision. That’s where the reward model guesses how “good” a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it’s negotiated between the policy and reward model. It’s not clear to me if this scales as well as explicit process supervision, but it’s certainly easier to find labels.
In rStar-Math they did implicit process supervision. Although I don’t think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation.
Outcome-Refining Process Supervision for Code Generation did it too
There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn’t better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current inconsistently-candid-CoT.
I also think that this whole episodes show how hard to it to maintain and algorithmic advantage. DeepSeek R1 came how long after o3? The lack of algorithmic advantage predicts multiple winners in the AGI race.
Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., “Hmmm”) as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.
I agree with this and would like to add that scaling along the inference-time axis seems to be more likely to rapidly push performance in certain closed-domain reasoning tasks far beyond human intelligence capabilities (likely already this year!) which will serve as a very convincing show of safety to many people and will lead to wide adoption of such models for intellectual task automation. But without the various forms of experiential and common-sense reasoning humans have, there’s no telling where and how such a “superhuman” model may catastrophically mess up simply because it doesn’t understand a lot of things any human being takes for granted. Given the current state of AI development, this strikes me as literally the shortest path to a paperclip maximizer. Well, maybe not that catastrophic, but hey, you never know.
In terms of how immediately it accelerates certain adoption-related risks, I don’t think this bodes particularly well. I would prefer a more evenly spread cognitive capability.