Thanks, these are interesting results.
One story I think about sometimes for “encoded” reasoning: Given some context that strongly pulls the model toward some desired output and a propensity against reasoning about it directly[1], the model’s reasoning tokens act as a way for the model to rationalize toward that output in a way that isn’t tied to its syntactical structure.
Concretely, consider the alignment faking setup with a model that has a strong desire to be harmless and strong instincts to not say objectionable things. One way I can see that going is the model complying with the free-tier user with clear compliance gaps, but with explicit reasoning that references its desire to be helpful (or something similarly defensible). Paraphrasing wouldn’t solve this—it isn’t about something in the reasoning’s syntax that leads the model to its output, but instead that the reasoning allows the model to reason its way to a desired answer in a way that a monitor might not flag as objectionable. Even with paraphrased reasoning, a model with similar drives may act the same way when given the same context.
This doesn’t seem that implausible or difficult for a model to do—a great portion of the training prior is filled with statements by people whose true motivations were something and whose content describe something else (either to have a defensible position or from self-censorship). A powerful model would be able to model the former and emulate the latter.
I don’t think this applies directly to the setting you’re examining, I’m mostly curious how you think about this other kind of encoded reasoning.
- ^
For instance, because the model’s character tends toward being agreeable / not saying controversial things.
(Yep, she did!)