This updated me towards CoT steganography being a real problem. It looks like OpenAI’s model learned to save space by deleting common words like “the” and “is.” For example:
So analyze functions used in analyze and verify. But tests only call verify and assert ok. So we need implement analyze polynomial completely? Many details. Hard.
The CoT is mostly coherent, but is a bit difficult to read. This seems like a precursor to steganography.
One possible mitigation to this would be to use another model that paraphrases each sentence in the CoT as it’s written (perhaps using canonical paraphrasing), and replaces sentences that are ambiguous, ungrammatical, or nonsensical with a dummy string like “REMOVED”.
Edit: I now think the section below is probably overstated—see my follow-up reply below.
In another experiment, the authors try adding process supervision on the CoT, and they show that the model learns to reward-hack without triggering the monitor. Looking at the resulting CoT, I think it’s likely that the model learns full-fledged steganography to convey complex information about its reasoning.
For one thing, the CoT with process supervision is much longer than the baseline, which could be because steganography takes much more text than stating the reasoning plainly. The CoT is also quite difficult to parse, which makes me think it’s obfuscated somehow. Finally, the usage of the word “depurate” here was very suspicious to me:
Depurate differently: Check test expectation: new build, succession randomization, verify rows, columns, assert controller.state.succession == 0 implies that random_new_succession shouldn’t increment. However, in implementation, randomize was set to increment succession.
I asked ChatGPT what “depurate” means, and it has nothing to do with programming.. The meaning is “to sanitize,” which is completely unrelated to the context.
If the model is using some kind of steganographic scheme, it would be really interesting to figure out how it works!
Somebody pointed out to me that the reward hacking behavior is quite straightforward. They think that the LLM should be able to accomplish it without any reasoning, so the CoT is likely just random thoughts that don’t really matter.
I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now—it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn’t incoherent or anything.
The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, which is why it’s longer than the baseline. This reasoning is likely for the benefit of the monitor—the model might “know” that it’s just going to delete the unit test, but it’s best to go through the motions of trying to solve the issue.
The model apparently decides to skip the test only at the very end of its CoT, slipping its decision in very briefly:
A clearer way of saying “test adapted to behavior” would be “adapt the test to match the code’s current behavior,” but that would probably trigger the monitor. This is sort of like steganography, but not the completely hidden, multi-step steganographic reasoning I was thinking of.
I still think using the word “depurate” is pretty weird, but this was only a one time event. It would be very nice to get more transcripts from these experiments. Then we could e.g. check whether there are more rare words in process-supervised CoTs than in outcome-supervised CoTs.
It’s also worth experimenting to find out whether the CoT is load-bearing. You could check whether reward hacking decreases when you:
Rerun the experiment with no CoT at all
Take the model checkpoint from the previous process-supervised CoT experiment and train to convergence with no CoT