IMO deepseek-r1 follows this proposal pretty closely.
Proposal part 1: Shoggoth/Face Distinction
r1′s CoT is the “shoggoth”. Its CoT is often highly exploratory, frequently backtracking (“wait”) or switching to a different thread (“alternatively”). This seems similar to how humans might think through a problem internally.
r1 is also the “face”, since it clearly demarcates the boundary between thinking (“<think> … </think>”) and action (everything else). This separation is enforced by a “format reward”.
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning
This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the “shoggoth”) is not evaluated for.
Proposal part 3: Use Paraphraser
This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.
since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented.
I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?
I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That’s why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.
IMO deepseek-r1 follows this proposal pretty closely.
r1′s CoT is the “shoggoth”. Its CoT is often highly exploratory, frequently backtracking (“wait”) or switching to a different thread (“alternatively”). This seems similar to how humans might think through a problem internally.
r1 is also the “face”, since it clearly demarcates the boundary between thinking (“<think> … </think>”) and action (everything else). This separation is enforced by a “format reward”.
This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the “shoggoth”) is not evaluated for.
This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.
since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented.
I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?
I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That’s why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.